Abstract
Since the inception of Web 2.0, the virtual data landscape has witnessed unprecedented flow of user generated texts. With applications ranging from product analysis and implementing business strategies to inculcating public opinion in governance processes, Sentiment analysis has become one of the focal points of scientific exploration. Sentiment analysis can be used to decipher subjective opinions expressed over various online platforms using natural language processing methods, statistical and machine learning techniques.
With English language having witnessed extensive research in this domain, deciphering the same in vernacular languages remains a prodigious task. While significant strides have been made in Indian languages in the recent years, there has been negligible progress with respect to Sentiment analysis for Odia language, owing to the unavailability of essential tools and resources to identify subjective information in Odia texts. This thesis aims at bridging this gap by building such tools and resources. In order to capture subjective information at sentence-level, a sentiment lexicon becomes a valuable tool. In this thesis, we propose two distinct approaches towards building such a lexicon for resource-poor Indian languages such as Odia. The first methodology involves a translation-based approach which makes use of available resources in English to create a source lexicon. Further using an appropriate machine translation system, the target lexicon is created.
A machine translation system between English and the target language being a prerequisite for this approach, the unavailability of the same for Odia language (and a few other resource-poor Indian languages) becomes a methodological drawback. In order to overcome such a disadvantage, we propound a synset-based approach to create the Odia sentiment lexicon. This approach makes use of a linked WordNet structure along with sentiment lexicons available for the used Indian languages.
To address the lack of sentiment annotated corpora, we have built two datasets from different domains. Firstly, we have created a corpus of 730 Odia poems annotated with sentimental polarity. An annotation scheme comprising a polarity identification questionnaire clubbed with a taxonomy of emotions has been adopted to label each poem, as positive or negative, at a document level. In order to establish a strong baseline for the corpus, several experiments have been conducted. We employed several machine learning techniques along with word-level features for the initial classification task. Usage of character-level features showed consistent improvementsto the baseline in comparison to word-level features. By creating a word embedding model for Odia language, we have exploited word vector representations as features for classification of Odia poems, the results of which are comparable to that of character-level features.
In the domain of news, the thesis also presents a sentiment annotated corpus of 2045 Odia sentences. For sentence-level annotation, proper annotation guidelines have been discussed in order to categorize each sentence as positive, negative or neutral. Baseline results for the corpus have been established with machine learning techniques, using both word and character-level features, for binary and ternary classification. Among these classifiers, Support Vector Machinesand Logistic Regression show comparable results in the classification task. Furthermore, in orderto bear witness to the reliability and utility of the Odia sentiment lexicon, affective words at sentence-level have been identified and used as features for classification. Inclusion of thesefeatures show consistent improvements for both binary and ternary classification in all metrics of evaluation. This testifies the usability of the created Odia sentiment lexicon for sentence-level sentiment classification.