Sarcasm Detection in Telugu Language

SARCASM DETECTION IN TELUGU LANGUAGE 










 


1)Problem Statement: The Main Aim of the project is to detect the whether the given statement or sentence is sarcastic or not.

2) Dataset: The dataset is of Telugu sarcastic sentences collected from the Telugu comedy shows and annotated as sarcastic or non-sarcastic.

3) Data Preprocessing: 
In the Data Preprocessing 3.1 Removal of Stop Words, 3.2 Removal of Punctuation marks, 3.3 Tokenization and  POS Tagging

3.1 Removal of Stop Words:
The Stops words are identified from the dataset and removed them. the stop words are words which do not contribute for the identification or classification of the sentences. some of the stop words are downloaded from the StandfordNLP telugu stop words data.


3.2. Removal of Punctuation Marks Punctuation marks are removed from the sentences the punctuation , where it is a pattern based approach punctuation marks are not much necessary and removed the punctuation marks like ”./$&*()!¿:¡” ˆ

3.3. Tokenization and POS tagging Words are tokenized and POS tagged them, POS tagging is the process of assigning a correct POS tag such as Noun, Verb, Adverb, etc., to each word of the given input sentence. POS taggers are developed by modelling the morphosyntactic structure of NLP . The POS tags are generated from the Tagger developed by IIITH As the tagger is trained on large data, the tagger is expected to handle large vocabulary and also predicting the tags of unknown words using known words. They followed HMM-based approach and the Indian language standard tagset which comprise 21 tags to build the tagger. The available Telugu tagger is based on TnT tagger, which is well known for its robustness and speed. 


4. Results
 After getting the POS tags and Punctuation and stop word removed sentences there are vectorized using TF-IDF vectorizer. These vectors are feeded to the Three models namely : Support vector machine, Random Forest and Multilayer Perceptron. Data is split in to 80 percent for training and 20 percent for testing. Three Techniques are used: 

4.1. Using only sentences: Here the sentences alone are provided to the Models to determine whether it is sarcastic or non sarcastic and got an accuracy of SVM - 74.5 Random Forest - 83.6 MLP - 76.2 

4.2. Using POS tags along with the sentences Here the sentences along with POS tags are used to train the model and got an accuracy of SVM - 74.9 Random Forest - 86.5 MLP - 86.9 

4.3. Using tri-gram with postags and sentences Here the sentences, pos tags are taken split into tri-grams and feed to the model and got an accuracy of SVM - 73.5 Random Forest - 86.9 MLP:- 74.9 

4.4. Using Word Embeddings and RNN, LSTM, GRU: The word Embeddings are generated using  the embedding layer  and trained the RNN, LSTM,GRU and achieved an accuracy of 89,89.5,90.25 percentage of accuracy respectively.




Comments