Sarcasm Detection in Telugu Language
SARCASM DETECTION IN TELUGU LANGUAGE
1)Problem Statement: The Main Aim of the project is to detect the whether the given statement or sentence is sarcastic or not.
2) Dataset: The dataset is of Telugu sarcastic sentences collected from the Telugu comedy shows and annotated as sarcastic or non-sarcastic.
3) Data Preprocessing:
In the Data Preprocessing 3.1 Removal of Stop Words, 3.2 Removal of Punctuation marks, 3.3 Tokenization and POS Tagging
3.1 Removal of Stop Words:
The Stops words are identified from the dataset and
removed them. the stop words are words which do not
contribute for the identification or classification of the sentences. some of the stop words are downloaded from the
StandfordNLP telugu stop words data.
3.2. Removal of Punctuation Marks
Punctuation marks are removed from the sentences the
punctuation , where it is a pattern based approach punctuation marks are not much necessary and removed the
punctuation marks like ”./$&*()!¿:¡” ˆ
3.3. Tokenization and POS tagging
Words are tokenized and POS tagged them, POS tagging
is the process of assigning a correct POS tag such as Noun,
Verb, Adverb, etc., to each word of the given input sentence.
POS taggers are developed by modelling the morphosyntactic structure of NLP . The POS tags are generated from the
Tagger developed by IIITH As the tagger is trained on large
data, the tagger is expected to handle large vocabulary and
also predicting the tags of unknown words using known
words. They followed HMM-based approach and the Indian
language standard tagset which comprise 21 tags to build the
tagger. The available Telugu tagger is based on TnT tagger,
which is well known for its robustness and speed.
4. Results
After getting the POS tags and Punctuation and stop
word removed sentences there are vectorized using TF-IDF
vectorizer.
These vectors are feeded to the Three models namely
: Support vector machine, Random Forest and Multilayer
Perceptron.
Data is split in to 80 percent for training and 20 percent
for testing. Three Techniques are used:
4.1. Using only sentences:
Here the sentences alone are provided to the Models to
determine whether it is sarcastic or non sarcastic and got an
accuracy of SVM - 74.5 Random Forest - 83.6 MLP - 76.2
4.2. Using POS tags along with the sentences
Here the sentences along with POS tags are used to train
the model and got an accuracy of SVM - 74.9 Random
Forest - 86.5 MLP - 86.9
4.3. Using tri-gram with postags and sentences
Here the sentences, pos tags are taken split into tri-grams
and feed to the model and got an accuracy of SVM - 73.5
Random Forest - 86.9 MLP:- 74.9
4.4. Using Word Embeddings and RNN, LSTM, GRU: The word Embeddings are generated using the embedding layer and trained the RNN, LSTM,GRU and achieved an accuracy of 89,89.5,90.25 percentage of accuracy respectively.


Comments
Post a Comment