This blog is about the presentation I made for my dissertation in the NLP – ML domain. Sentiment Analysis is easy for English. It’s a little but not much difficult for other languages such as Hindi or Marathi. It gets difficult when the script used for the other languages is Roman. What makes it most difficult is when that transliterated text is a mixture of multiple languages. In our case we are considering two combinations of English-Hindi and English-Marathi. This mixture of languages is called code-mix. Usually one of the language (usually Roman script) is used for the textual representation. How to analyse the sentiments in this situation.
Ordinary Sentiment Analysis
The steps involved in the usual analysis is as follows:
- Normalize the word tokens
- Perform the POS Tag analysis to get the words tagged as either noun, adjective, adverb and verb.
- Use wordnet to get the identifier and use senti-wordnet to get the sentiment associated with those words along with the corresponding tags we found in the POS tag analysis
- Convert the words in to features:
- Remove stop words
- Remove the words senti-scores that are not important such as everything except adjective and adverbs, since they are the most important ones where sentiments are concerned
- Create an ngram feature to hook words with its context
- Feed it to a classifier for training and then test on the subject to calculate accuracy
- Cross validation is used to not over train
- A threshold is found that balances overtraining and generality to analyse unseen data
Problems with the generic approach for mixed code
Although the steps are more involving and I’ve missed a few, but the general approach is covered in the above steps. The only problem with the transliterated mixed-code is that it doesn’t work with the above steps. Why? Because the transliterated text does not have a fixed grammar, nor the words are spelled correctly. The sentence construction does not follow a single language specific rule. If the chunks of each language are taken out, then in many cases; they don’t form the correct grammatical syntax of their respective languages. The last part happens due to the mixture of language rules. An example is in order: Maine tujhe Good luck tere exams ke liye nahi bola because tumse na hoga. In the previous statement, the usage of English is following the contextual grammar of Hindi language and hence their POS tagging and their associated sentiments are highly ambiguous in this case. This leads to the conclusion that mix-code scripts require a different approach for this purpose and that is exactly what my dissertation is all about.
Since figures speak the thousand words, I’d let the following figures illustrate the approach that I took for the purpose of sentiment analysis.
The above figure gives all the important steps that are needed for this approach as listed below again, for those not interested in spending time on the figure.
- See if the word tokens are an emoticon or a known slang like lol, etc and then replace them with appropriate language works like 😊 will become smiling and lol will become laughing out loud.
- Transliterate the remain words in Hindi/Marathi language script (Devanagari) and look up the words in Hindi/Marathi dictionary for its existence. Check for spelling variations. If found tag them as such.
- Lookup all the words in English dictionary and tag them as such. If there is tie with the other language then use word frequency probability to break the tie.
- Loop up the sentiment scores from sentiwordnet of each languages (Hindi/English), for each token based on their language tags. Except in case of Marathi, which presently does not have the sentiwordnet, in that case, use bilingual dictionary to get the meanings of Marathi words from hindi and English and then follow the process for those languages and merge the scores.
- Finally convert the words to features using ngrams with their corresponding sentiment scores and perform the cross – validation of training and test data to see the accuracy of results.
These steps a lot more complicated to implement and after using python Scikit-learn to implement these steps I was able to get the result as shown in the following graphs.
The worst part is, even these do not work as good as sentiment analysis for a single language works. The best I could do is get above 50% accuracy for Hindi and English and 80% accuracy for Marathi and English combination. Now 80% seems like good but there is a caveat. The problem with Marathi / English combination is that the data is general skewed by the presence of fixed words. The data I used came from youtube comments and there were some words in both the positive as well as negative groups, which tend to be the absolute indicator of the sentiment polarity. This makes the system works but in truth it really isn’t working. I’m not proud of the results, but I’m glad I learned a lot from this. Until next time.