Prompt-Based LLM Analysis of Policy Agenda Evolution in U.S. Presidential Debates (1960–2024)
Lavanya Sunkeswari Prahallad,Radhika Mamidi
nternational Conference on Federated Learning and Intelligent Computing Systems, FLICS, 2026
@inproceedings{bib_Prom_2026, AUTHOR = {Prahallad, Lavanya Sunkeswari and Mamidi, Radhika }, TITLE = {Prompt-Based LLM Analysis of Policy Agenda Evolution in U.S. Presidential Debates (1960–2024)}, BOOKTITLE = {nternational Conference on Federated Learning and Intelligent Computing Systems}. YEAR = {2026}}
Presidential debates provide a recurring, institu-
tionally comparable record of how candidates prioritize public
issues during election campaigns. This paper presents a longi-
tudinal analysis of policy agendas in U.S. presidential and vice-
presidential debates from 1960 to 2024. We analyze 48 debate
transcripts using a hybrid framework that combines interpretable
keyword-based topic identification with large language model
(LLM)-assisted semantic topic extraction and sentence-level topic
clustering. The analysis identifies durable and shifting com-
ponents of debate agendas across six decades. Foreign policy
and security are especially prominent in the early Cold War
era, domestic economic governance remains consistently central
across the entire period, and healthcare, immigration, climate,
and institutional concerns become increasingly visible in later
decades. In addition to topic prominence, we examine agenda
diversity and transitions between dominant policy domains across
election cycles. The results show that presidential debate agendas
evolve gradually rather than abruptly, while becoming more
topically diverse in recent decades. These findings demonstrate
the value of LLM-assisted methods for historically grounded
analysis of political discourse and for tracing long-term changes
in policy emphasis in democratic elections.
EthiQuest: LLM-Powered Ethical Questionnaire Generation for Research Review
Ishank,Radhika Mamidi,Rahul Mishra
International Conference on Language Resources and Evaluation, LREC, 2026
@inproceedings{bib_Ethi_2026, AUTHOR = {Ishank, and Mamidi, Radhika and Mishra, Rahul }, TITLE = {EthiQuest: LLM-Powered Ethical Questionnaire Generation for Research Review}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2026}}
Building upon the critical importance of ethical considerations in research, we introduce a novel task of Ethical
Questionnaire Generation (EQG) for research papers. Ethical review has become an indispensable component of
the research process, helping identify potential risks, biases, and societal impacts that may arise from scientific work.
In this paper, we present EthiQuest, a comprehensive dataset comprising 3663 research papers paired with their
corresponding ethical questionnaires extracted from major conference proceedings. We explore various approaches
leveraging large language models (LLMs) to automatically generate context-aware ethical questionnaires, examining
the unique challenges of capturing domain-specific ethical concerns, ensuring comprehensive coverage of potential
issues, and maintaining question relevance and clarity. Our experiments demonstrate the effectiveness of fine-tuned
LLMs in generating pertinent ethical questions across diverse research domains. We provide detailed analysis
of question quality, coverage metrics, and practical insights for deploying such systems in real-world research review processes. The EQG dataset and code can be accessed at https://github.com/Ishank-Kapania/eqg/.
Large Language Models Decide Early and Explain Later
Ayan Datta,Zhixue Zhao,Bhuvanesh Verma,Radhika Mamidi,Mounika Marreddy,Alexander Mehler
Technical Report, arXiv, 2026
@inproceedings{bib_Larg_2026, AUTHOR = {Datta, Ayan and Zhao, Zhixue and Verma, Bhuvanesh and Mamidi, Radhika and Marreddy, Mounika and Mehler, Alexander }, TITLE = {Large Language Models Decide Early and Explain Later}, BOOKTITLE = {Technical Report}. YEAR = {2026}}
Large Language Models often achieve strong performance by generating long intermediate chain-of-thought reasoning. However, it remains unclear when a model's final answer is actually determined during generation. If the answer is already fixed at an intermediate stage, subsequent reasoning tokens may constitute post-decision explanation, increasing inference cost and latency without improving correctness. We study the evolution of predicted answers over reasoning steps using forced answer completion, which elicits the model's intermediate predictions at partial reasoning prefixes. Focusing on Qwen3-4B and averaging results across all datasets considered, we find that predicted answers change in only 32% of queries. Moreover, once the final answer switch occurs, the model generates an average of 760 additional reasoning tokens per query, accounting for a substantial fraction of the total reasoning budget. Motivated by these findings, we investigate early stopping strategies that halt generation once the answer has stabilized. We show that simple heuristics, including probe-based stopping, can reduce reasoning token usage by 500 tokens per query while incurring only a 2% drop in accuracy. Together, our results indicate that a large portion of chain-of-thought generation is redundant and can be reduced with minimal impact on performance.
From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks
Ayan Datta,Mounika Marreddy,Alexander Mehler,Zhixue Zhao,Radhika Mamidi
Technical Report, arXiv, 2026
Abs | | bib Tex
@inproceedings{bib_From_2026, AUTHOR = {Datta, Ayan and Marreddy, Mounika and Mehler, Alexander and Zhao, Zhixue and Mamidi, Radhika }, TITLE = {From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks}, BOOKTITLE = {Technical Report}. YEAR = {2026}}
Large language models (LLMs) exhibit failures on elementary symbolic tasks such as character counting in a word, despite excelling on complex benchmarks. Although this limitation has been noted, the internal reasons remain unclear. We use character counting (e.g., "How many p's are in apple?") as a minimal, controlled probe that isolates token-level reasoning from higher-level confounds. Using this setting, we uncover a consistent phenomenon across modern architectures, including LLaMA, Qwen, and Gemma: models often compute the correct answer internally yet fail to express it at the output layer.
Through mechanistic analysis combining probing classifiers, activation patching, logit lens analysis, and attention head tracing, we show that character-level information is encoded in early and mid-layer representations. However, this information is attenuated by a small set of components in later layers, especially the penultimate and final layer MLP. We identify these components as negative circuits: subnetworks that downweight correct signals in favor of higher-probability but incorrect outputs.
Our results lead to two contributions. First, we show that symbolic reasoning failures in LLMs are not due to missing representations or insufficient scale, but arise from structured interference within the model's computation graph. This explains why such errors persist and can worsen under scaling and instruction tuning. Second, we provide evidence that LLM forward passes implement a form of competitive decoding, in which correct and incorrect hypotheses coexist and are dynamically reweighted, with final outputs determined by suppression as much as by amplification.
These findings carry implications for interpretability and robustness: simple symbolic reasoning exposes weaknesses in modern LLMs, underscoring need for design strategies that ensure information is encoded and reliably used.
Confabulations from ACL Publications (CAP):A Dataset for Scientific Hallucination Detection
Federica Gamba,Binesh Arakkal Remesh,Aryan Ashok Chandramania,Rohit Agarwa,Chuyuan Li,Ioana Buhnila,Radhika Mamidi,Aman Sinha,Timothee Mickus,Raúl Vázquez,Patanjali Bhamidipati,Claudio Savelli,Ahana Chattopadhyay,Laura Zanella,Yash Kankanampati
International Conference on Language Resources and Evaluation, LREC, 2026
@inproceedings{bib_Conf_2026, AUTHOR = {Gamba, Federica and Remesh, Binesh Arakkal and Chandramania, Aryan Ashok and Agarwa, Rohit and Li, Chuyuan and Buhnila, Ioana and Mamidi, Radhika and Sinha, Aman and Mickus, Timothee and Vázquez, Raúl and Bhamidipati, Patanjali and Savelli, Claudio and Chattopadhyay, Ahana and Zanella, Laura and Kankanampati, Yash }, TITLE = {Confabulations from ACL Publications (CAP):A Dataset for Scientific Hallucination Detection}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2026}}
We introduce the CAP (Confabulations from ACL Publications) dataset, a multilingual resource for studying
hallucinations in large language models (LLMs) within scientific text generation. CAP focuses on the scientific
domain, where hallucinations can distort factual knowledge, as they frequently do. In this domain, however, the
presence of specialized terminology, statistical reasoning, and context-dependent interpretations further exacerbates
these distortions, particularly given LLMs’ lack of true comprehension, limited contextual understanding, and bias
toward surface-level generalization. CAP operates in a cross-lingual setting covering five high-resource languages
(English, French, Hindi, Italian, and Spanish) and four low-resource languages (Bengali, Gujarati, Malayalam, and
Telugu). The dataset comprises 900 curated scientific questions and over 7,000 LLM-generated answers from
16 publicly available models, provided as question–answer pairs along with token sequences and corresponding
logits. Each instance is annotated with a binary label indicating the presence of a scientific hallucination, denoted as
a factuality error, and a fluency label, capturing issues in the linguistic quality or naturalness of the text. CAP is
publicly released to facilitate advanced research on hallucination detection, multilingual evaluation of LLMs, and the
development of more reliable scientific NLP systems.
TeluguEval: A Comprehensive Benchmark for Evaluating LLM Capabilities in Telugu
Revanth Kumar Gundam,Radhika Mamidi
Conference of the European Chapter of the Association for Computational Linguistics (EACL), EACL, 2026
@inproceedings{bib_Telu_2026, AUTHOR = {Gundam, Revanth Kumar and Mamidi, Radhika }, TITLE = {TeluguEval: A Comprehensive Benchmark for Evaluating LLM Capabilities in Telugu}, BOOKTITLE = {Conference of the European Chapter of the Association for Computational Linguistics (EACL)}. YEAR = {2026}}
Large Language Models (LLMs) excel on English reasoning tasks but falter on morphologically rich, low-resource languages such as Telugu, Tamil, and Kannada. We present TeluguEval, a human-curated reasoning benchmark created by translating GSM8K (math), Winogrande (commonsense), ARC (science), CaseHOLD (law), and Hendrycks Ethics into Telugu. We evaluate eight models spanning global (Llama-3.1-8B, Llama-2-7B, Qwen-8B, Gemma-7B, Gemini-2.0) and regional (Telugu-Llama2-7B, Indic-Gemma-7B, Sarvam-m-24B) systems. While extremely strong models such as Gemini and Sarvam-m largely retain performance in Telugu, most English-centric models suffer severe accuracy drops, often exceeding 30 to 40 points, particularly on mathematical and scientific reasoning. We further observe systematic failure modes including script sensitivity, option-selection bias, repetition loops, and unintended code-switching. Our results demonstrate that surface-level Telugu fluency does not imply robust reasoning capability, underscoring the need for Telugu-specific data, tokenization, and pretraining. TeluguEval provides a standardized testbed to drive progress on reasoning in low-resource Indian languages.
Quantifying Bias in Text Genrative AI models
Devisetti Sai Asrith,Radhika Mamidi
Journal of Mathematics and Computer Science, IJMCS, 2025
@inproceedings{bib_Quan_2025, AUTHOR = {Asrith, Devisetti Sai and Mamidi, Radhika }, TITLE = {Quantifying Bias in Text Genrative AI models}, BOOKTITLE = {Journal of Mathematics and Computer Science}. YEAR = {2025}}
Generative artificial intelligence (AI), especially large language models (LLMs), is increasingly
deployed in domains such as recruitment, content creation, and education. While these systems
accelerate productivity, they also risk reproducing and amplifying societal biases (Ahuchogu et
al., 2025). This project addresses the urgent challenge of identifying, quantifying, and mitigating
gender bias in text-generative AI outputs, with a focus on job narratives. Building on my
independent study of 11,000+ AI-generated job narratives, which we generated using Gemini AI,
we introduce a bias quantification framework using mean bias, mean absolute bias, sentiment
skew (via TextBlob), and distributional measures (via Kullback–Leibler divergence and related
distances). Preliminary results show measurable gendered patterns across generated narratives,
validating the hypothesis of proposed gender bias in LLM.
The proposed work extends this foundation in three directions: expanding bias quantification
using probabilistic distribution distances (Devisetti, 2024)(Chung et al., 1989), evaluating
prompt-construction bias and multi-model comparisons across GPT-3, GPT-4, Gemini, and
open-source LLMs (Blodgett et al., 2020), and integrating interpretable embedding methods
(e.g., SPINE)(Subramanian et al., 2017) for transparency in downstream debiasing.
The expected contribution is both theoretical and practical: a robust bias quantification pipeline
grounded in probability theory, and actionable strategies to mitigate bias in LLM-generated
recruitment texts(Ferrara, 2024). Beyond recruitment, the proposed methodology aims to serve
as a standard for bias evaluation in generative AI applications more broadly.
IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?
Aravapalli Akhilesh,Mounika Marreddy,Radhika Mamidi,Manish Gupta,Subba Reddy Oota
International Joint Conference on Natural Language Processing - Findings, IJCNLP-F, 2025
@inproceedings{bib_Indi_2025, AUTHOR = {Akhilesh, Aravapalli and Marreddy, Mounika and Mamidi, Radhika and Gupta, Manish and Oota, Subba Reddy }, TITLE = {IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?}, BOOKTITLE = {International Joint Conference on Natural Language Processing - Findings}. YEAR = {2025}}
Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately ~47K sentences. Our probing analysis of surface, syntactic, and semantic properties reveals that, while almost all multilingual models demonstrate consistent encoding performance for English, surprisingly, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages.
Emotion-Aware Dysarthric Speech Reconstruction: LLMs and Multimodal Evaluation with MCDS
Kaushal Attaluri,Radhika Mamidi
International Joint Conference on Natural Language Processing - Findings, IJCNLP-F, 2025
@inproceedings{bib_Emot_2025, AUTHOR = {Attaluri, Kaushal and Mamidi, Radhika }, TITLE = {Emotion-Aware Dysarthric Speech Reconstruction: LLMs and Multimodal Evaluation with MCDS}, BOOKTITLE = {International Joint Conference on Natural Language Processing - Findings}. YEAR = {2025}}
Over 46 million people worldwide suffer from
dysarthria—a motor speech disorder caused by
neurological conditions like stroke or Parkin
son’s disease—making their speech slurred, un
intelligible, and emotionally distorted. This
severely affects communication, quality of life,
and social inclusion.
Wepresent the first emotion-aware framework
for dysarthric speech reconstruction, where the
speaker’s emotion is detected from audio and
used to guide large language models in recover
ing intelligible, emotionally faithful sentences.
To evaluate this, we introduce a novel
metric—Multimodal
Communication
Dysarthria Score (MCDS)—which holisti
cally measures both linguistic and emotional
accuracy. Our results show strong improve
ments over traditional baselines, offering a
breakthrough toward emotionally intelligent
assistive speech systems that prioritize both
understanding and empathy
Aligning Text/Speech Representations from Multimodal Models with MEG
Brain Activity During Listening
Padakanti Srijith,Khushbu Pahwa,Radhika Mamidi,Bapi Raju Surampudi,Manish Gupta,OOTA SUBBA REDDY
Conference on Empirical Methods in Natural Language Processing, EMNLP, 2025
@inproceedings{bib_Alig_2025, AUTHOR = {Srijith, Padakanti and Pahwa, Khushbu and Mamidi, Radhika and Surampudi, Bapi Raju and Gupta, Manish and REDDY, OOTA SUBBA }, TITLE = {Aligning Text/Speech Representations from Multimodal Models with MEG
Brain Activity During Listening}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2025}}
Although speech language models are expected
to align well with brain language processing
during speech comprehension, recent studies
have found that they fail to capture brain
relevant semantics beyond low-level features.
Surprisingly, text-based language models ex
hibit stronger alignment with brain language
regions, as they better capture brain-relevant
semantics. However, no prior work has exam
ined the alignment effectiveness of text/speech
representations from multimodal models. This
raises several key questions: Can speech em
beddings from such multimodal models capture
brain-relevant semantics through cross-modal
interactions? Which modality can take advan
tage of this synergistic multimodal understand
ing to improve alignment with brain language
processing? Can text/speech representations
from such multimodal models outperform uni
modal models? To address these questions,
we systematically analyze multiple multimodal
models, extracting both text- and speech-based
representations to assess their alignment with
MEGbrain recordings during naturalistic story
listening. We find that text embeddings from
both multimodal and unimodal models signif
icantly outperform speech embeddings from
these models. Specifically, multimodal text
embeddings exhibit a peak around 200 ms, sug
gesting that they benefit from speech embed
dings, with heightened activity during this time
period. However, speech embeddings from
these multimodal models still show a similar
alignment compared to their unimodal counter
parts, suggesting that they do not gain meaning
ful semantic benefits over text-based represen
tations. These results highlight an asymmetry
in cross-modal knowledge transfer, where the
text modality benefits more from speech infor
mation, but not vice versa. We make the code
publicly available.
Analyzing Biases in Political Dialogue: Tagging U.S. Presidential Debates with an Extended DAMSL Framework
Lavanya Sunkeswari Prahallad,Radhika Mamidi
Technical Report, arXiv, 2025
@inproceedings{bib_Anal_2025, AUTHOR = {Prahallad, Lavanya Sunkeswari and Mamidi, Radhika }, TITLE = {Analyzing Biases in Political Dialogue: Tagging U.S. Presidential Debates with an Extended DAMSL Framework}, BOOKTITLE = {Technical Report}. YEAR = {2025}}
We present a critical discourse analysis of the 2024 U.S. presidential debates, examining Donald Trump's rhetorical strategies in his interactions with Joe Biden and Kamala Harris. We introduce a novel annotation framework, BEADS (Bias Enriched Annotation for Dialogue Structure), which systematically extends the DAMSL framework to capture bias driven and adversarial discourse features in political communication. BEADS includes a domain and language agnostic set of tags that model ideological framing, emotional appeals, and confrontational tactics. Our methodology compares detailed human annotation with zero shot ChatGPT assisted tagging on verified transcripts from the Trump and Biden (19,219 words) and Trump and Harris (18,123 words) debates. Our analysis shows that Trump consistently dominated in key categories: Challenge and Adversarial Exchanges, Selective Emphasis, Appeal to Fear, Political Bias, and Perceived Dismissiveness. These findings underscore his use of emotionally charged and adversarial rhetoric to control the narrative and influence audience perception. In this work, we establish BEADS as a scalable and reproducible framework for critical discourse analysis across languages, domains, and political contexts.
Voices of Dissent: A Multimodal Analysis of Protest Songs through Lyrics and Audio
Utsav Shekhar,Radhika Mamidi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2025
@inproceedings{bib_Voic_2025, AUTHOR = {Shekhar, Utsav and Mamidi, Radhika }, TITLE = {Voices of Dissent: A Multimodal Analysis of Protest Songs through Lyrics and Audio}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2025}}
Music has long served as a vehicle for political expression, with protest songs playing a central role in articulating dissent and mobilizing collective action. Yet, despite their cultural significance, the linguistic and acoustic signatures that define protest music remain understudied. We present a multimodal computational analysis of protest and non-protest songs spanning multiple decades. Using NLP and audio analysis, we identify the linguistic and musical features that differentiate protest songs. Instead of focusing on classification performance, we treat classification as a diagnostic tool to investigate these features and reveal broader patterns. Protest songs are not just politically charged they are acoustically and linguistically distinct, and we quantify how.
The Evolution of Gen Alpha Slang: Linguistic Patterns and AI Translation Challenges
Ishita,Radhika Mamidi
Association for Computational Linguistics: Student Research Workshop, ACL - W, 2025
@inproceedings{bib_The__2025, AUTHOR = {Ishita, and Mamidi, Radhika }, TITLE = {The Evolution of Gen Alpha Slang: Linguistic Patterns and AI Translation Challenges}, BOOKTITLE = {Association for Computational Linguistics: Student Research Workshop}. YEAR = {2025}}
Generation Alpha (born 2010-2024) is the first generation fully raised within the digital ecosystem. They exhibit unique linguistic behaviours influenced by rampant online communication and platform-specific cultures. This study examines the rapid evolution of Gen Alpha slang through a comparative analysis of Millennial and Gen Z vernacular. We identify three core linguistic patterns: extreme lexical compression, digital culture-driven semantic shifts and part-of-speech conversion. We construct a comprehensive slang corpus sourced from online platforms and evaluate the performance of four AI translation systems (viz. Google Translate, ChatGPT 4, Gemini 1.0, DeepSeek v3) on over 100 slang terms. Our results reveal significant translation challenges rooted in culturally-bound terms from gaming, meme culture, and mental health discourse. Most errors are the result of inadequate cultural contextualization, with literal translations dominating the error patterns. Our findings highlight the critical limitations in current language models and emphasize the need for adaptive, culturally sensitive and context-aware frameworks that can handle the dynamic lexicon of evolving youth vernacular.
Zero at SemEval-2025 Task 11: Multilingual Emotion Classification with BERT Variants: A Comparative Study
Revanth Kumar Gundam,Marri Abhinav,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2025
@inproceedings{bib_Zero_2025, AUTHOR = {Gundam, Revanth Kumar and Abhinav, Marri and Mamidi, Radhika }, TITLE = {Zero at SemEval-2025 Task 11: Multilingual Emotion Classification with BERT Variants: A Comparative Study}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2025}}
Emotion detection in text plays a very crucial
role in NLP applications such as sentiment anal-
ysis and feedback analysis. This study tackles
two tasks: multi-label emotion detection, where
the goal is to classify text based on six emotions
(joy, sadness, fear, anger, surprise, and disgust)
in a multilingual setting, and emotion intensity
prediction, which assigns an ordinal intensity
score to each of the above-given emotions.
Using the BRIGHTER dataset, a multilingual
corpus spanning 28 languages, the paper ad-
dresses issues like class imbalances by treating
each emotion as an independent binary classifi-
cation problem. The paper first explores strate-
gies such as static embeddings such as GloVe
with logistic regression classifiers on top of it.
To capture contextual nuances more effectively,
we fine-tune transformer based models, such
as BERT and RoBERTa. Our approach demon-
strates the benefits of fine-tuning for improved
emotion prediction, while also highlighting the
challenges of multilingual and multi-label clas-
sification.
Zero at SemEval-2025 Task 2: Entity-Aware Machine Translation: Fine-Tuning NLLB for Improved Named Entity Translation
Revanth Kumar Gundam,Marri Abhinav,Malladi Bhaskararama Sahishna Advaith,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2025
@inproceedings{bib_Zero_2025, AUTHOR = {Gundam, Revanth Kumar and Abhinav, Marri and Advaith, Malladi Bhaskararama Sahishna and Mamidi, Radhika }, TITLE = {Zero at SemEval-2025 Task 2: Entity-Aware Machine Translation: Fine-Tuning NLLB for Improved Named Entity Translation}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2025}}
Machine Translation (MT) is an essential tool
for communication among people across different cultures, yet Named Entity (NE) translation remains a major challenge due to its rarity in occurrence and ambiguity. Traditional
approaches, like using lexicons or parallel corpora, often fail to generalize to unseen entities
and, hence, do not perform well. To address
this, we create a silver dataset using the Google
Translate API and fine-tune the facebook/nllb200-distilled-600M model with LoRA (LowRank Adaptation) to enhance translation accuracy while also maintaining efficient memory
use. Evaluated with metrics such as BLEU,
COMET, and M-ETA, our results show that
fine-tuning a specialized MT model improves
NE translation without having to rely on largescale general-purpose models.
Choose Your Words Wisely: Domain-adaptive Masking Makes Language Models Learn Faster
Vanshpreet Singh Kohli,Aaron Anthony Monis,Radhika Mamidi
Workshop on Representation Learning for NLP, RepL4NLP, 2025
@inproceedings{bib_Choo_2025, AUTHOR = {Kohli, Vanshpreet Singh and Monis, Aaron Anthony and Mamidi, Radhika }, TITLE = {Choose Your Words Wisely: Domain-adaptive Masking Makes Language Models Learn Faster}, BOOKTITLE = {Workshop on Representation Learning for NLP}. YEAR = {2025}}
Foundational Language Models perform significantly better on downstream tasks in specialised domains (such as law, computer science, and medical science) upon being further pre-trained on extensive domain-specific corpora, but this continual pre-training incurs heavy computational costs. Indeed, some of the most performant specialised language models such as BioBERT incur even higher computing costs during domain-specific training than the pre-training cost of the foundational models they are initialised from. In this paper, we argue that much of the extended pre-training is redundant, with models seemingly wasting valuable resources re-learning lexical and semantic patterns already well-represented in their foundational models such as BERT, T5 and GPT. Focusing on Masked Language Models, we introduce a novel domain-specific masking strategy that is designed to facilitate continual learning while minimizing the training cost. Using this approach, we train and present a BERT-based model trained on a biomedical corpus that matches or surpasses traditionally trained biomedical language models in performance across several downstream classification tasks while incurring up to 11 times lower training costs.
Choose Your Words Wisely: Domain-adaptive Masking Makes Language Models Learn Faster
Vanshpreet Singh Kohli,Aaron Anthony Monis,Radhika Mamidi
International Conference on Artificial Intelligence in Medicine, AIME, 2025
@inproceedings{bib_Choo_2025, AUTHOR = {Kohli, Vanshpreet Singh and Monis, Aaron Anthony and Mamidi, Radhika }, TITLE = {Choose Your Words Wisely: Domain-adaptive Masking Makes Language Models Learn Faster}, BOOKTITLE = {International Conference on Artificial Intelligence in Medicine}. YEAR = {2025}}
Foundational Language Models perform significantly better on downstream tasks in the biomedical domain upon being further pre-trained on extensive biomedical corpora, but this continual pre-training incurs heavy computational costs. Indeed, some of the most performant biomedical language models incur even more computing costs during domain-specific training than the entire training cost of the foundational models they are warm-started from. In this paper, we argue that much of the extended pre-training is redundant, with models seemingly wasting valuable resources re-learning lexical and semantic patterns already well-represented in their foundational models such as BERT, T5 and GPT. Focusing on Masked Language Models, we introduce a novel domain-specific masking strategy that is designed to facilitate continual learning while minimizing the training cost. Using this approach, we train and present a BERT-based model that matches or surpasses traditionally trained biomedical language models in performance across several downstream classification tasks while incurring up to 11 times lower training costs.
Implicature Benchmark for Hindi
Kaveri Anuranjana,Amit Shukla,Srihitha Mallepally,Mareddy Sri Harshitha,Radhika Mamidi
Linguistic Data Consortium for Indian Languages, LDC-IL, 2025
@inproceedings{bib_Impl_2025, AUTHOR = {Anuranjana, Kaveri and Shukla, Amit and Mallepally, Srihitha and Harshitha, Mareddy Sri and Mamidi, Radhika }, TITLE = {Implicature Benchmark for Hindi}, BOOKTITLE = {Linguistic Data Consortium for Indian Languages}. YEAR = {2025}}
According to the cooperative principle (Grice 1975), utterances in a conversation adhere to the four maxims of conversational implicature – quantity, relation, quality and manner. An implicature arises when a maxim is flouted or contradicted, and the speaker is forced to infer a meaning. Consider the example:
Speaker A: का िदलीप को परीकाएं आसान लग रही है?
Kya Dilip ko parikshayein aasan lag rahi hain?
(Does Dilip find the exam easy?)
Speaker B: आज-कल वो सो नही पा रहा |
[Aaj-kal wo so nahi pa raha.]
(He hasn’t been able to sleep lately.)
Speaker B flouts the maxim of relevance by not answering the question with a yes or no. Instead, his reply that Dilip has trouble sleeping implies that he doesn’t find the exams to be easy.
While computational approaches have made some progress in English implicature benchmarks - GRICE (Zheng, 2021), IMPRESS (Jeretic 2020) and Conversational Implicature benchmark (CIB) (George and Mamidi, 2020), there are no benchmarks for Hindi. Hence, we propose a Hindi implicature benchmark-
Indirect Questions Hi Implicature Benchmark - Toledo-Ronen (2020) found that translation- based methods lose nuances in argument evaluation. XCOPA benchmark (Ponti 2020) & Huang (2024) demonstrated that translated data hurts model performance for structurally divergent non- English languages. Hindi is a free word order, morphologically rich language. It is structurally different from English hence, translating English implicature benchmarks may not be unsuitable.
Wang (2024) present indirect answers to yes-no questions in dialogues benchmark. Similar to their approach, we plan to collect Hindi interviews and extract indirect questions along with their answers involving implicature. We will annotate 5000 such questions. We will maintain a constraint that previous dialogues will be added to the question as a background to simplify the inference over a single dialogue turn. Evaluation will rely on ROUGE scores to measure correctness, as we leave exploration of suitable metrics for future work. For model evaluation, we plan to leverage a) English LLMs (meta-llama/Llama-3.3-70B (Grattafiori 2024) and the current state-of-the-art reported by Sravanthi (2024), FlanT5-XXL (Chung 2022) finetuned on Hindi translated GRICE (Zheng 2021), IMPRESS (Jeretic 2020) and CIB (George and Mamidi 2020) and b) Hindi LLMs (Llama-3-Nanda-10B-Chat (Choudhury 2024) and Airavata (Gala 2024)).
Sandarśana: A Survey on Sanskrit Computational Linguistics and Digital Infrastructure for Sanskrit
Anagha Pradeep,Radhika Mamidi
ACM Computing Surveys, ACM-CS, 2025
@inproceedings{bib_Sand_2025, AUTHOR = {Pradeep, Anagha and Mamidi, Radhika }, TITLE = {Sandarśana: A Survey on Sanskrit Computational Linguistics and Digital Infrastructure for Sanskrit}, BOOKTITLE = {ACM Computing Surveys}. YEAR = {2025}}
Computational Linguistics is an interdisciplinary field of computer science and linguistics that focuses on designing computational models and algorithms for processing, analyzing, and generating human language. Over recent years, this field has made substantial progress. While its primary emphasis tends to center around widely spoken languages, there is equal importance in investigating languages that are not commonly spoken but have contributed immensely to the literature, culture, and philosophy of the society. Thus, this survey article comprehensively delves into the exploration of computational tasks undertaken for Sanskrit, an ancient language of the Indian sub-continent steeped in a wealth of literary heritage. The purpose of this study is to provide an overview of the progress made thus far in the computational analysis of Sanskrit, while also reviewing the current digital infrastructure that supports these efforts. Additionally, our study also identifies potential avenues for future research, serving as a reference for anyone interested in advancing their exploration in this field.
adithjrajeev at SemEval-2025 Task 10: Sequential Learning for Role Classification Using Entity-Centric News Summaries
Adith John Rajeev,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2025
Abs | | bib Tex
@inproceedings{bib_adit_2025, AUTHOR = {Rajeev, Adith John and Mamidi, Radhika }, TITLE = {adithjrajeev at SemEval-2025 Task 10: Sequential Learning for Role Classification Using Entity-Centric News Summaries}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2025}}
There is a high prevalence of disinformation and manipulative narratives in online new sources today, and verification of its informative integrity is a vital need as online audience is highly susceptible to being affected by such propaganda or disinformation. The task of verifying any online information is, however, a significant challenge. The task Multilingual Characterization and Extraction of Narratives from Online News, therefore focuses on developing novel methods of analyzing news ecosystems and detecting manipulation attempts to address this challenge. As a part of this effort, we focus on the subtask
of Entity Framing, which involves assigning named entities in news articles one of three main roles ( Protagonist, Antagonist, and Innocent) with a further fine-grained role distinction. We propose a pipeline that involves summarizing the article with the summary being centered around the entity. The entity and its entity-centric summary is then used as input for a BERT-based classifier to carry out the final role classification. Finally, we experiment with different approaches in the steps of the pipeline and compare the results obtained by them.
Survey on Computational Approaches to Implicature
Kaveri Anuranjana,Mareddy Sri Harshitha,Srihitha Mallepally,Amit Shukla,Radhika Mamidi
International Conference on Natural Language Processing., ICON, 2025
@inproceedings{bib_Surv_2025, AUTHOR = {Anuranjana, Kaveri and Harshitha, Mareddy Sri and Mallepally, Srihitha and Shukla, Amit and Mamidi, Radhika }, TITLE = {Survey on Computational Approaches to Implicature}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2025}}
This paper explores the concept of solving implicature in Natural Language Processing (NLP), highlighting its significance in understanding indirect communication. Drawing on foundational theories by Austin, Searle, and
Grice, we discuss how implicature extends beyond literal language to convey nuanced meanings. We review existing datasets, including the Pragmatic Understanding Benchmark (PUB), that assess models’ capabilities in recognizing and interpreting implicatures. Despite recent
advances in large language models (LLMs), challenges remain in effectively processing implicature due to limitations in training data and the complexities of contextual interpretation.
We propose future directions for research, including the enhancement of datasets and the integration of pragmatic reasoning tasks, to improve LLMs’ understanding of implicature and facilitate better human-computer interaction.
Significance of Chain of Thought in Gender Bias Mitigation for English-Dravidian Machine Translation
Lavanya Sunkeswari Prahallad,Radhika Mamidi
International Conference on the AI Revolution: Research, Ethics, and Society, AIR-RES, 2024
@inproceedings{bib_Sign_2024, AUTHOR = {Prahallad, Lavanya Sunkeswari and Mamidi, Radhika }, TITLE = {Significance of Chain of Thought in Gender Bias Mitigation for English-Dravidian Machine Translation}, BOOKTITLE = {International Conference on the AI Revolution: Research, Ethics, and Society}. YEAR = {2024}}
Gender bias in machine translation (MT) sys- tems poses a significant challenge to achieving accurate and inclusive translations. This paper examines gender bias in machine translation systems for languages such as Telugu and Kan- nada from the Dravidian family, analyzing how gender inflections affect translation accuracy and neutrality using Google Translate and Chat- GPT. It finds that while plural forms can reduce bias, individual-centric sentences often main- tain the bias due to historical stereotypes. The study evaluates the Chain of Thought process- ing, noting significant bias mitigation from 80% to 4% in Telugu and from 40% to 0% in Kan- nada. It also compares Telugu and Kannada translations, emphasizing the need for language specific strategies to address these challenges and suggesting directions for future research to enhance fairness in both data preparation and prompts during inference.
Towards Enhancing Knowledge Accessibility for Low-Resource Indian Languages: A Template Based Approach
Chelpuri Abhijith,Aravapalli Akhilesh,Padakanti Srijith,Radhika Mamidi
International Conference on Natural Language Processing., ICON, 2024
@inproceedings{bib_Towa_2024, AUTHOR = {Abhijith, Chelpuri and Akhilesh, Aravapalli and Srijith, Padakanti and Mamidi, Radhika }, TITLE = {Towards Enhancing Knowledge Accessibility for Low-Resource Indian Languages: A Template Based Approach}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2024}}
In today’s digital age, access to knowledge and information is crucial for societal growth. Although widespread resources like Wikipedia exist, there is still a linguistic barrier to breakdown for low-resource languages. In India, millions of individuals still lack access to reliable information from Wikipedia because they are only proficient in their regional language. To address this gap, our work focuses on enhancing the content and digital footprint of multiple Indian languages. The primary objective of our work is to improve knowledge accessibility by generating a substantial volume of high-quality Wikipedia articles in Telugu, a widely spoken language in India with around 95.7 million native speakers. Our work aims to create Wikipedia articles and also ensures that each article meets necessary quality standards such as a minimum word count, inclusion of images for reference, and an infobox. Our work also adheres to the five core principles of Wikipedia. We streamline our article generation process, leveraging NLP techniques such as translation, transliteration, and template generation and incorporating human intervention when necessary. Our contribution is a collection of 8,929 articles in the movie domain, now ready to be published on Telugu Wikipedia.
Towards Efficient Audio-Text Keyword Spotting: Quantization and Multi-Scale Linear Attention with Foundation Models
Rahothvarman P,Radhika Mamidi
International Conference on Natural Language Processing., ICON, 2024
@inproceedings{bib_Towa_2024, AUTHOR = {P, Rahothvarman and Mamidi, Radhika }, TITLE = {Towards Efficient Audio-Text Keyword Spotting: Quantization and Multi-Scale Linear Attention with Foundation Models}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2024}}
Open Vocabulary Keyword Spotting is essential in numerous applications, from virtual assistants to security systems, as it allows systems to identify specific words or phrases in continuous speech. In this paper, we propose a novel end-to-end method for detecting user-defined open vocabulary keywords by leveraging linguistic patterns for the correlation between audio and text modalities. Our approach utilizes quantized pre-trained foundation models for robust audio embeddings and a unique lightweight Multi-Scale Linear Attention (MSLA) network that aligns speech and text representations for effective cross-modal agreement. We evaluate our method on two distinct datasets, comparing its performance against other baselines. The results highlight the effectiveness of our approach, achieving significant improvements over the Cross-Modality Correspondence Detector (CMCD) method, with a 16.08% increase in AUC and a 17.2% reduction in EER metrics on the Google Speech Commands dataset. These findings demonstrate the potential of our method to advance keyword spotting across various real-world applications.
Bridge the GAP: Multi-lingual Models For Ambiguous Pronominal Coreference Resolution in South Asian Languages
Adith John Rajeev,Rahothvarman P,Kaveri Anuranjana,Radhika Mamidi
Workshop on Challenges in Processing South Asian Languages, CHiPSAL-W, 2024
@inproceedings{bib_Brid_2024, AUTHOR = {Rajeev, Adith John and P, Rahothvarman and Anuranjana, Kaveri and Mamidi, Radhika }, TITLE = {Bridge the GAP: Multi-lingual Models For Ambiguous Pronominal Coreference Resolution in South Asian Languages}, BOOKTITLE = {Workshop on Challenges in Processing South Asian Languages}. YEAR = {2024}}
Coreference resolution, the process of determining what a referring expression (a pronoun or a noun phrase) refers to in discourse, is a critical aspect of natural language understanding. However, the development of computational models for coreference resolution in
low-resource languages, such as the Dravidian (and more broadly all South Asian) languages, still remains a significant challenge due to the scarcity of annotated corpora in these languages. To address this data scarcity, we adopt a pipeline that translates the English GAP
dataset into various South Asian languages, creating a multi-lingual coreference dataset mGAP.
Our research aims to leverage this dataset and
develop two novel models, namely the joint embedding model and the cross attention model for coreference resolution with Dravidian languages in mind. We also demonstrate that cross-attention captures pronoun-candidate relations better leading to improved coreference resolution. We also harness the similarity across South Asian languages via transfer learning in
order to use high resource languages to learn coreference for low resource languages.
Maha Bhaashya at SemEval-2024 Task 6: Zero-Shot Multi-task Hallucination Detection
Malladi Bhaskararama Sahishna Advaith,Patanjali Bhamidipati,Manish Shrivastava,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2024
@inproceedings{bib_Maha_2024, AUTHOR = {Advaith, Malladi Bhaskararama Sahishna and Bhamidipati, Patanjali and Shrivastava, Manish and Mamidi, Radhika }, TITLE = {Maha Bhaashya at SemEval-2024 Task 6: Zero-Shot Multi-task Hallucination Detection}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2024}}
In recent studies, the extensive utilization of large language models has underscored the im- portance of robust evaluation methodologies for assessing text generation quality and rel- evance to specific tasks. This has revealed a prevalent issue known as hallucination, an emergent condition in the model where gener- ated text lacks faithfulness to the source and deviates from the evaluation criteria. In this study, we formally define hallucination and pro- pose a framework for its quantitative detection in a zero-shot setting, leveraging our definition and the assumption that model outputs entail task and sample specific inputs. In detecting hallucinations, our solution achieves an accu- racy of 0.78 in a model-aware setting and 0.61 in a model-agnostic setting. Notably, our so- lution maintains computational efficiency, re- quiring far less computational resources than other SOTA approaches, aligning with the trend towards lightweight and compressed models.
Context and WSD: Analysing Google Translate’s Sanskrit to English Output of Bhagavadgītā Verses for Word Meaning
Anagha Pradeep,Radhika Mamidi,Pavankumar Satuluri
International Sanskrit Computational Linguistics Symposium, ISCLS, 2024
@inproceedings{bib_Cont_2024, AUTHOR = {Pradeep, Anagha and Mamidi, Radhika and Satuluri, Pavankumar }, TITLE = {Context and WSD: Analysing Google Translate’s Sanskrit to English Output of Bhagavadgītā Verses for Word Meaning}, BOOKTITLE = {International Sanskrit Computational Linguistics Symposium}. YEAR = {2024}}
In addition to innate human intelligence, having access to extensive context and world knowledge is a crucial factor that aids in comprehending natural language, making it smooth and effortless to understand words with multiple meanings for humans. Although machines lack intrinsic intelligence, their capacity to learn language can greatly improve with access to more data, which serves as valuable context. In Natural Language Processing (NLP), the task of identifying and attributing the right sense of a word in a given context is called Word Sense Disambiguation (WSD). WSD, as a sub-task, plays a crucial role in several NLP applications such as Machine Translation. Every language has a set of words that have multiple senses. Sanskrit, one of the ancient and classical languages of the Indian subcontinent is no exception to this. Like many other languages with a rich literary tradition, Sanskrit features a multitude of polysemous words. However, it is essential to acknowledge that the data used to train machine models on Sanskrit is considerably less compared to European and a few other Indian languages. Consequently, the task of disambiguating word senses in Sanskrit presents a highly complex challenge for machines, especially when considering the unique and rich nature of its literary language. The purpose of this paper is to delineate the potential areas where the infusion of additional data can enhance language learning, through a manual error analysis taxonomy focused on the Bhagavadgītā. Our analysis will delve into the translation outcomes produced by Google Translate, which is considered the state-of-the-art tool for handling Sanskrit and other languages with limited available resources.
SemEval-2024 Task 8: Weighted Layer Averaging RoBERTa for Black-Box Machine-Generated Text Detection
Ayan Datta,Aryan Ashok Chandramania,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2024
@inproceedings{bib_SemE_2024, AUTHOR = {Datta, Ayan and Chandramania, Aryan Ashok and Mamidi, Radhika }, TITLE = {SemEval-2024 Task 8: Weighted Layer Averaging RoBERTa for Black-Box Machine-Generated Text Detection}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2024}}
Abstract This document contains the details of the authors' submission to the proceedings of SemEval 2024's Task 8: Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection Subtask A (monolingual) and B. Detection of machine-generated text is becoming an increasingly important task, with the advent of large language models (LLMs). In this paper, we lay out how using weighted averages of RoBERTa layers lets us capture information about text that is relevant to machine-generated text detection
Mast Kalandar at SemEval-2024 Task 8: On the Trail of Textual Origins: RoBERTa-BiLSTM Approach to Detect AI-Generated Text
Bafna Jainit Sushil,Hardik Mittal,Suyash Sethia,Manish Shrivastava,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2024
@inproceedings{bib_Mast_2024, AUTHOR = {Sushil, Bafna Jainit and Mittal, Hardik and Sethia, Suyash and Shrivastava, Manish and Mamidi, Radhika }, TITLE = {Mast Kalandar at SemEval-2024 Task 8: On the Trail of Textual Origins: RoBERTa-BiLSTM Approach to Detect AI-Generated Text}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2024}}
Large Language Models (LLMs) have showcased impressive abilities in generating fluent responses to diverse user queries. However, concerns regarding the potential misuse of such texts in journalism, educational, and academic contexts have surfaced. SemEval 2024 introduces the task of Multigenerator, Multidomain, and Multilingual Black-Box MachineGenerated Text Detection, aiming to develop automated systems for identifying machinegenerated text and detecting potential misuse. In this paper, we i) propose a RoBERTaBiLSTM based classifier designed to classify text into two categories: AI-generated or human ii) conduct a comparative study of our model with baseline approaches to evaluate its effectiveness. This paper contributes to the advancement of automatic text detection systems in addressing the challenges posed by machinegenerated text misuse. In the official leaderboard for the task, our architecture was ranked 46th, achieving an accuracy of 0.8083 on the leaderboard.
Transformer-based Context Aware Morphological Analyzer for Telugu
Chelpuri Abhijith,Dasari Priyanka,Nagaraju Vuppala,Mounika Marreddy,Radhika Mamidi,Parameswari Krishnamurthy
Workshop on Speech and Language Technologies for Dravidian Languages, DravidianLangTech, 2023
@inproceedings{bib_Tran_2023, AUTHOR = {Abhijith, Chelpuri and Priyanka, Dasari and Vuppala, Nagaraju and Marreddy, Mounika and Mamidi, Radhika and Krishnamurthy, Parameswari }, TITLE = {Transformer-based Context Aware Morphological Analyzer for Telugu}, BOOKTITLE = {Workshop on Speech and Language Technologies for Dravidian Languages}. YEAR = {2023}}
This paper addresses the challenges faced by Indian languages in leveraging deep learning for natural language processing (NLP) due to limited resources, annotated datasets, and Transformer-based architectures. We specifically focus on Telugu and aim to construct a Telugu morph analyzer dataset comprising 10,000 sentences. Furthermore, we assess the performance of established multilingual Transformer models (m-Bert, XLMR, IndicBERT) and mono-lingual Transformer model BERT-Te (trained from scratch on an extensive Telugu corpus comprising 80,15,588 sentences). Our findings demonstrate the efficacy of Transformer-based representations pre-trained on Telugu data improved the performance of the Telugu morph analyzer, surpassing existing multi-lingual approaches. This highlights the necessity of developing dedicated corpora, annotated datasets, and machine learning models in a mono-lingual setting. Using our dataset, we present benchmark results for the Telugu morph analyzer achieved through simple fine-tuning on BERT-Te. The morph analyzer dataset 1 and codes are open-sourced and available here.
DAP-LeR-DAug: Techniques for enhanced Online Sexism Detection
Jayant Panwar,Radhika Mamidi
International Conference on Natural Language and Speech Processing., ICNLSP, 2023
@inproceedings{bib_DAP-_2023, AUTHOR = {Panwar, Jayant and Mamidi, Radhika }, TITLE = {DAP-LeR-DAug: Techniques for enhanced Online Sexism Detection}, BOOKTITLE = {International Conference on Natural Language and Speech Processing.}. YEAR = {2023}}
The swift surge of digital communication on social media platforms has brought about an increase in hate speech online, especially sexism. Such content can have devastating effects on the psychological well-being of the users, and it becomes imperative to design automated systems that can identify and flag such harmful content. Human moderation alone is inadequate to manage the volume of content, necessitating efficient technological solutions. In this study, we explore the performance of different modern techniques on Bert-based models for detecting sexist text. We explore four such techniques, namely, Domain Adaptive Pre-training (DAP), Learning Rate Scheduling (LeR), Data Augmentation (DAug), and an ensemble of all three. The results show that each technique improves performance differently on each task due to their different approaches, which may be suited to a certain problem more. The ensemble model performs the best in all three subtasks. These models are trained on a Semeval’23 shared task dataset, which includes both sexist and non-sexist texts. All in all, this study explores the potential of DAP-LeR-DAug techniques in detecting sexist content. The results of this study highlight the strengths and weaknesses of the three different techniques with respect to each subtask. The results of this study will be useful for researchers and developers interested in developing systems for identifying and flagging online hate speech.
Enhancing Code-mixed Text Generation Using Synthetic Data Filtering in Neural Machine Translation
Dama Sravani,Radhika Mamidi
The SIGNLL Conference on Computational Natural Language Learning, CoNLL, 2023
Abs | | bib Tex
@inproceedings{bib_Enha_2023, AUTHOR = {Sravani, Dama and Mamidi, Radhika }, TITLE = {Enhancing Code-mixed Text Generation Using Synthetic Data Filtering in Neural Machine Translation}, BOOKTITLE = {The SIGNLL Conference on Computational Natural Language Learning}. YEAR = {2023}}
CoPara: The First Dravidian Paragraph-level n-way Aligned Corpus
E Nikhil,Mukund Choudhary,Radhika Mamidi
Workshop on Speech and Language Technologies for Dravidian Languages, DravidianLangTech, 2023
@inproceedings{bib_CoPa_2023, AUTHOR = {Nikhil, E and Choudhary, Mukund and Mamidi, Radhika }, TITLE = {CoPara: The First Dravidian Paragraph-level n-way Aligned Corpus}, BOOKTITLE = {Workshop on Speech and Language Technologies for Dravidian Languages}. YEAR = {2023}}
We present CoPara, the first publicly available paragraph-level (n-way aligned) multilingual parallel corpora for Dravidian languages. The collection contains 2856 paragraph/passage pairs between English and four Dravidian languages. We source the parallel paragraphs from the New India Samachar magazine and align them with English as a pivot language. We do human and artificial evaluations to validate the high-quality alignment and richness of the parallel paragraphs of a range of lengths. To show one of the many ways this dataset can be wielded, we finetuned IndicBART, an S2S transformer model on NMT for all XX-En pairs of languages in CoPara which perform better than existing sentence-level models on standard benchmarks (like BLEU) on sentence level translations and longer text too. We show how this dataset can enrich a model trained for a task like this, with more contextual cues and beyond sentence understanding even in low-resource settings like that of Dravidian languages. Finally, the dataset and models are made available publicly at GitHub to help advance research in Dravidian NLP, parallel multilingual, and beyond sentence-level tasks like NMT, etc.
Multilingual Bias Detection and Mitigation for Low Resource Languages
Anubhav Sharma,Ankita Maity,Tushar Abhishek,Rudra Dhar,Radhika Mamidi,Manish Gupta,Vasudeva Varma Kalidindi
Wiki Workshop, Wiki-W, 2023
@inproceedings{bib_Mult_2023, AUTHOR = {Sharma, Anubhav and Maity, Ankita and Abhishek, Tushar and Dhar, Rudra and Mamidi, Radhika and Gupta, Manish and Kalidindi, Vasudeva Varma }, TITLE = {Multilingual Bias Detection and Mitigation for Low Resource Languages}, BOOKTITLE = {Wiki Workshop}. YEAR = {2023}}
Subjective bias in Wikipedia textual data is a significant problem and affects millions of readers worldwide. Though some monolingual work has been done in classifying and debiasing biased text in resource-rich languages, the low-resource languages with large numbers of speakers remain unattended. We present an approach for the dual problems of multilingual bias detection and its mitigation with a thorough analysis. In this work, we establish competitive baselines on our preliminary approach, which includes classification-based modelling for bias detection on a multilingual dataset curated from existing monolingual sources. For the problem of bias mitigation, we follow the style transfer paradigm and model using transformer-based seq2seq architectures. We also discuss several approaches for further improvement in both problems as a part of our ongoing work.
Blind Leading the Blind: A Social-Media Analysis of the Tech Industry
Tanishq Chaudhary,Pulak Malhotra,Radhika Mamidi,Ponnurangam Kumaraguru
International Conference on Natural Language Processing., ICON, 2023
@inproceedings{bib_Blin_2023, AUTHOR = {Chaudhary, Tanishq and Malhotra, Pulak and Mamidi, Radhika and Kumaraguru, Ponnurangam }, TITLE = {Blind Leading the Blind: A Social-Media Analysis of the Tech Industry}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2023}}
Automatically Generating Hindi Wikipedia Pages Using Wikidata as a Knowledge Graph: A Domain-Specific Template Sentences Approach
ADITYA AGARWAL,Radhika Mamidi
Recent advance in Natural language Processing, RANLP, 2023
@inproceedings{bib_Auto_2023, AUTHOR = {AGARWAL, ADITYA and Mamidi, Radhika }, TITLE = {Automatically Generating Hindi Wikipedia Pages Using Wikidata as a Knowledge Graph: A Domain-Specific Template Sentences Approach}, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2023}}
This paper presents a method for automatically generating Wikipedia articles in the Hindi language, using Wikidata as a knowledge base. Our method extracts structured information from Wikidata, such as the names of entities, their properties, and their relationships, and then uses this information to generate natural language text that conforms to a set of templates designed for the domain of interest. We evaluate our method by generating articles about scientists, and we compare the resulting articles to machine-translated articles. Our results show that more than 70% of the generated articles using our method are better in terms of coherence, structure, and readability. Our approach has the potential to significantly reduce the time and effort required to create Wikipedia articles in Hindi and could be extended to other languages and domains as well.
GAE-ISUMM: Unsupervised Graph-baseSummarization for Indian Languages
Vakada Lakshmi Sireesha,Chaluvadi Anudeep,Mounika Marreddy,Subba Reddy Oota,Radhika Mamidi
International Joint Conference on Neural Networks, IJCNN, 2023
@inproceedings{bib_GAE-_2023, AUTHOR = {Sireesha, Vakada Lakshmi and Anudeep, Chaluvadi and Marreddy, Mounika and Oota, Subba Reddy and Mamidi, Radhika }, TITLE = {GAE-ISUMM: Unsupervised Graph-baseSummarization for Indian Languages}, BOOKTITLE = {International Joint Conference on Neural Networks}. YEAR = {2023}}
Document summarization aims to create a precise and coherent summary of a text document. Many deep learning summarization models are developed mainly for English, often requiring a large training corpus and efficient pre-trained language models and tools. However, English summarization models for low-resource Indian languages are often limited by rich morphological variation, syntax, and semantic differences. In this paper, we propose GAE-ISumm, an unsupervised Indic summarization model that extracts summaries from text documents. In particular, our proposed model, GAE-ISumm uses Graph Autoencoder (GAE) to learn text representations and a document summary jointly. We also provide a manually-annotated Telugu summarization dataset TELSUM, to experiment with our model GAE-ISumm. Further, we experiment with the most publicly available Indian language summarization datasets to investigate the effectiveness of GAE-ISumm on other Indian languages. Our experiments of GAE-ISumm in seven languages make the following observations: (i) it is competitive or better than state-of-the-art results on all datasets, (ii) it reports benchmark results on TELSUM, and (iii) the inclusion of positional and cluster information in the proposed model improved the performance of summaries.
PanwarJayant at SemEval-2023 Task 10: Exploring the Effectiveness of Conventional Machine Learning Techniques for Online Sexism Detection
Jayant Panwar,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2023
@inproceedings{bib_Panw_2023, AUTHOR = {Panwar, Jayant and Mamidi, Radhika }, TITLE = {PanwarJayant at SemEval-2023 Task 10: Exploring the Effectiveness of Conventional Machine Learning Techniques for Online Sexism Detection}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2023}}
The rapid growth of online communication using social media platforms has led to an increase in the presence of hate speech, especially in terms of sexist language online. The proliferation of such hate speech has a significant impact on the mental health and well-being of the users and hence the need for automated systems to detect and filter such texts. In this study, we explore the effectiveness of conventional machine learning techniques for detecting sexist text. We explore five conventional classifiers, namely, Logistic Regression, Decision Tree, XGBoost, Support Vector Machines, and Random Forest. The results show that different classifiers perform differently on each task due to their different inherent architectures which may be suited to a certain problem more. These models are trained on the shared task dataset, which includes both sexist and non-sexist texts.All in all, this study explores the potential of conventional machine learning techniques in detecting online sexist content. The results of this study highlight the strengths and weaknesses of all classifiers with respect to all subtasks. The results of this study will be useful for researchers and practitioners interested in developing systems for detecting or filtering online hate speech.
Matt Bai at SemEval-2023 Task 5: Clickbait spoiler classification via BERT
Nukit Tailor,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2023
@inproceedings{bib_Matt_2023, AUTHOR = {Tailor, Nukit and Mamidi, Radhika }, TITLE = {Matt Bai at SemEval-2023 Task 5: Clickbait spoiler classification via BERT}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2023}}
The Clickbait Spoiling shared task aims at tackling two aspects of spoiling: classifying the spoiler type based on its length and generating the spoiler. This paper focuses on the task of classifying the spoiler type. Better classification of the spoiler type would eventually help in generating a better spoiler for the post. We use BERT-base (cased) to classify the clickbait posts. The model achieves a balanced accuracy of 0.63 as we give only the post content as the input to our model instead of the concatenation of the post title and post content to find out the differences that the post title might be bringing in.
Billy-Batson at SemEval-2023 Task 5: An Information Condensation based System for Clickbait Spoiling
Anubhav Sharma,Sagar Sandeep Joshi,Tushar Abhishek,Radhika Mamidi,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2023
@inproceedings{bib_Bill_2023, AUTHOR = {Sharma, Anubhav and Joshi, Sagar Sandeep and Abhishek, Tushar and Mamidi, Radhika and Kalidindi, Vasudeva Varma }, TITLE = {Billy-Batson at SemEval-2023 Task 5: An Information Condensation based System for Clickbait Spoiling}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2023}}
The Clickbait Challenge targets spoiling the clickbaits using short pieces of information known as spoilers to satisfy the curiosity in- duced by a clickbait post. The large context of the article associated with the clickbait and differences in the spoiler forms, make the task challenging. Hence, to tackle the large con- text, we propose an Information Condensation- based approach, which prunes down the unnec- essary context. Given an article, our filtering module optimised with a contrastive learning objective first selects the parapraphs that are the most relevant to the corresponding clickbait. The resulting condensed article is then fed to the two downstream tasks of spoiler type clas- sification and spoiler generation. We demon- strate and analyze the gains from this approach on both the tasks. Overall, we win the task of spoiler type classification and achieve competi- tive results on spoiler generation
Witcherses at SemEval-2023 Task 12: Ensemble Learning for African Sentiment Analysis
Monil Gokani,K V Aditya Srivatsa,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2023
@inproceedings{bib_Witc_2023, AUTHOR = {Gokani, Monil and Srivatsa, K V Aditya and Mamidi, Radhika }, TITLE = {Witcherses at SemEval-2023 Task 12: Ensemble Learning for African Sentiment Analysis}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2023}}
This paper describes our system submission for SemEval-2023 Task 12 AfriSenti-SemEval: Sentiment Analysis for African Languages. We propose an XGBoost-based ensemble model trained on emoticon frequency-based features and the predictions of several statistical models such as SVMs, Logistic Regression, Random Forests, and BERT-based pre-trained language models such as AfriBERTa and AfroXLMR. We also report results from additional experi- ments not in the system. Our system achieves a mixed bag of results, achieving a best rank of 7th in three of the languages - Igbo, Twi, and Yoruba
GSAC: A Gujarati Sentiment Analysis Corpus from Twitter
Monil Gokani,Radhika Mamidi
Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, WASSA, 2023
@inproceedings{bib_GSAC_2023, AUTHOR = {Gokani, Monil and Mamidi, Radhika }, TITLE = {GSAC: A Gujarati Sentiment Analysis Corpus from Twitter}, BOOKTITLE = {Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis}. YEAR = {2023}}
Sentiment Analysis is an important task for analysing online content across languages for tasks such as content moderation and opinion mining. Though a significant amount of resources are available for Sentiment Analysis in several Indian languages, there do not exist any large-scale, open-access corpora for Gujarati. Our paper presents and describes the Gujarati Sentiment Analysis Corpus (GSAC), which has been sourced from Twitter and manually annotated by native speakers of the language. We describe in detail our collection and annotation processes and conduct extensive experiments on our corpus to provide reliable baselines for future work using our dataset.
Warning: It’sa scam!! Towards understanding the Employment Scams using Knowledge Graphs
Nidhi Goyal,Niharika Sachdeva,Radhika Mamidi,Ponnurangam Kumaraguru
Joint International Conference on Data Science & Management of Data, CODS-COMAD, 2023
@inproceedings{bib_Warn_2023, AUTHOR = {Goyal, Nidhi and Sachdeva, Niharika and Mamidi, Radhika and Kumaraguru, Ponnurangam }, TITLE = {Warning: It’sa scam!! Towards understanding the Employment Scams using Knowledge Graphs}, BOOKTITLE = {Joint International Conference on Data Science & Management of Data}. YEAR = {2023}}
Employment scams, such as scapegoat positions, clickbait and non-existing jobs, etc., are among the top five scams registered over online platforms.1 Generally, scam complaints contain heterogeneous information (money, location, employment type, organization, email, and phone number), which can provide critical insights for appropriate interventions to avoid scams. Despite substantial efforts to analyze employment scams, integrating relevant scam-related information in structured form remains unexplored. In this work, we extract this information and construct a large-scale Employment Scam Knowledge Graph consisting of 0.1M entities and 0.2M relationships. Our findings include discovering different modes of employment scams, entities, and relationships among entities to alert job seekers. We plan to extend this work by utilizing a knowledge graph to identify and avoid potential scams in the future
A code-mixed task-oriented dialog dataset for medical domain
Dowlagar suman,Radhika Mamidi
Computer Speech & Language, CS&L, 2023
Abs | | bib Tex
@inproceedings{bib_A_co_2023, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {A code-mixed task-oriented dialog dataset for medical domain}, BOOKTITLE = {Computer Speech & Language}. YEAR = {2023}}
In the healthcare domain, medical and patient interactions form a crucial part of the diagnosis. Initially, the AI models developed for healthcare centered only on monolingual data. However, such models do not cater to the multilingual regions, where most conversations are Code-Mixed. We present the Code-Mixed Medical Task-Oriented Dialog Dataset to facilitate the research and development of Code-Mixed medical dialog systems. We analyzed the dataset using medical, conversational, and linguistic theories. The dataset contains 3005 Telugu–English Code-Mixed dialogs between patients and doctors with 29 k utterances covering ten specializations with an average code-mixing index (CMI) of 33.3%. We manually annotated the conversational dataset with intents and slot labels. We also present baselines to establish benchmarks on the dataset using existing state-of-the-art Natural Language Understanding …
English To Indian Sign Language: Rule-Based Translation System Along With Multi-Word Expressions and Synonym Substitution
Abhigyan Ghosh,Radhika Mamidi
International Conference on Natural Language Processing., ICON, 2022
@inproceedings{bib_Engl_2022, AUTHOR = {Ghosh, Abhigyan and Mamidi, Radhika }, TITLE = {English To Indian Sign Language: Rule-Based Translation System Along With Multi-Word Expressions and Synonym Substitution}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2022}}
The hearing challenged communities all over the world face difficulties to communicate with others. Machine translation has been one of the prominent technologies to facilitate communication with the deaf and hard of hearing community worldwide. We have explored and formulated the fundamental rules of Indian Sign Language(ISL) and implemented them as a translation mechanism of English Text to Indian Sign Language glosses. According to the formulated rules and sub-rules, the source text structure is identified and transferred to the target ISL gloss. This target language is such that it can be easily converted to videos using the Indian Sign Language dictionary. This research work also mentions the intermediate phases of the transfer process and innovations in the process such as Multi-Word Expression detection and synonym substitution to handle the limited vocabulary size of Indian Sign Language while producing semantically accurate translations.
Hate Speech Detection on Code-Mixed Dataset Using a Fusion of Custom and Pre-trained Models with Profanity Vector Augmentation
Dowlagar suman,Radhika Mamidi
SN Computer Science, SNCS, 2022
Abs | | bib Tex
@inproceedings{bib_Hate_2022, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {Hate Speech Detection on Code-Mixed Dataset Using a Fusion of Custom and Pre-trained Models with Profanity Vector Augmentation}, BOOKTITLE = {SN Computer Science}. YEAR = {2022}}
With the increase in user-generated content on social media networks, hate speech and offensive language content are also increasing. From the perspective of computer science, automatic detection of such hate speech and offensive language content is an interesting problem to solve. The natural language community has taken a step to identify such content via automated hate speech and offensive content detection. The hate speech content is generated mostly on social media, and automatic hate speech and offensive language detection face many challenges due to non-standard spelling and grammar variations. Specifically, in a multilingual community, the hate content would be in code-mixed form, making the task further challenging. In this article, we propose a model for code-mixed hate speech detection. This model embeds the knowledge from both user-trained and multilingual pre-trained models. The proposed method also calculates the profanity word list and augments it. Experimental results on code-mixed hate speech and offensive language detection benchmarks show that our method outperforms the existing baselines.
On the Importance of Karaka Framework in Multi-modal Grounding
GORTHI SAI KIRAN,Radhika Mamidi
Technical Report, arXiv, 2022
@inproceedings{bib_On_t_2022, AUTHOR = {KIRAN, GORTHI SAI and Mamidi, Radhika }, TITLE = {On the Importance of Karaka Framework in Multi-modal Grounding}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Computational Paninian Grammar (CPG) model helps in decoding a natural language expression as a series of modifier-modified relations and therefore facilitates in identifying dependency relations closer to language/context semantics compared to the usual stanford dependency relations. However, the importance of this CPG dependency scheme has not been studied in the context of multi-modal vision and language applications. At IIIT-H, we plan to perform a novel study to explore the potential advantages and disadvantages of CPG framework in a vision-language navigation task setting, a popular and challenging multimodal grounding task.
DepressionOne@ LT-EDI-ACL2022: Using Machine Learning with SMOTE and Random UnderSampling to Detect Signs of Depression on Social Media Text.
Dowlagar suman,Radhika Mamidi
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion, EDI-W, 2022
@inproceedings{bib_Depr_2022, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {DepressionOne@ LT-EDI-ACL2022: Using Machine Learning with SMOTE and Random UnderSampling to Detect Signs of Depression on Social Media Text.}, BOOKTITLE = {Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion}. YEAR = {2022}}
Depression is a common and serious medical illness that negatively affects how you feel, the way you think, and how you act. Detecting depression is essential as it must be treated early to avoid painful consequences. Nowadays, people are broadcasting how they feel via posts and comments. Using social media, we can extract many comments related to depression and use NLP techniques to train and detect depression. This work presents the submission of the DepressionOne team at LT-EDI-2022 for the shared task, detecting signs of depression from social media text. The depression data is small and unbalanced. Thus, we have used oversampling and undersampling methods such as SMOTE and RandomUnderSampler to represent the data. Later, we used machine learning methods to train and detect the signs of depression.
Sammaan@ lt-edi-acl2022: Ensembled transformers against Homophobia and Transphobia
Ishan Sanjeev Upadhyay,K V Aditya Srivatsa,Radhika Mamidi
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion, EDI-W, 2022
@inproceedings{bib_Samm_2022, AUTHOR = {Upadhyay, Ishan Sanjeev and Srivatsa, K V Aditya and Mamidi, Radhika }, TITLE = {Sammaan@ lt-edi-acl2022: Ensembled transformers against Homophobia and Transphobia}, BOOKTITLE = {Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion}. YEAR = {2022}}
Hateful and offensive content on social media platforms can have negative effects on users and can make online communities more hostile towards certain people and hamper equality, diversity and inclusion. In this paper, we describe our approach to classify homophobia and transphobia in social media comments. We used an ensemble of transformer based models to build our classifier. Our model ranked 2nd for English, 8th for Tamil and 10th for Tamil-English
TeluguNER: Leveraging Multi-Domain Named Entity Recognition with Deep Transformers
Suma Reddy Duggenpudi,Oota Subba Reddy,Mounika Marreddy,Radhika Mamidi
Association for Computational Linguistics: Student Research Workshop, ACL - W, 2022
@inproceedings{bib_Telu_2022, AUTHOR = {Duggenpudi, Suma Reddy and Reddy, Oota Subba and Marreddy, Mounika and Mamidi, Radhika }, TITLE = {TeluguNER: Leveraging Multi-Domain Named Entity Recognition with Deep Transformers}, BOOKTITLE = {Association for Computational Linguistics: Student Research Workshop}. YEAR = {2022}}
Named Entity Recognition (NER) is a successful and well-researched problem in English due to the availability of resources. The transformer models, specifically the masked-language models (MLM), have shown remarkable performance in NER in recent times. With growing data in different online platforms, there is a need for NER in other languages too. NER remains underexplored in Indian languages due to the lack of resources and tools. Our contributions in this paper include (i) Two annotated NER datasets for the Telugu language in multiple domains: Newswire Dataset (ND) and Medical Dataset (MD), and we combined ND and MD to form a Combined Dataset (CD) (ii) Comparison of the finetuned Telugu pretrained transformer models (BERT-Te, RoBERTa-Te, and ELECTRA-Te) with other baseline models (CRF, LSTM-CRF, and BiLSTM-CRF) (iii) Further investigation of the performance of Telugu pretrained transformer models against the multilingual models mBERT (Devlin et al., 2018), XLM-R (Conneau et al., 2020), and IndicBERT (Kakwani et al., 2020). We find that pretrained Telugu language models (BERTTe and RoBERTa) outperform the existing pretrained multilingual and baseline models in NER. On a large dataset (CD) of 38,363 sentences, the BERT-Te achieves a high F1-score of 0.80 (entity-level) and 0.75 (token-level). Further, these pretrained Telugu models have shown state-of-the-art performance on various
Detection of Propaganda Techniques in Visuo-Lingual Metaphor in Memes
Gundapu Sunil,Radhika Mamidi
Technical Report, arXiv, 2022
@inproceedings{bib_Dete_2022, AUTHOR = {Sunil, Gundapu and Mamidi, Radhika }, TITLE = {Detection of Propaganda Techniques in Visuo-Lingual Metaphor in Memes}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
The exponential rise of social media networks has allowed the production, distribution, and consumption of data at a phenomenal rate. Moreover, the social media revolution has brought a unique phenomenon to social media platforms called Internet memes. Internet memes are one of the most popular contents used on social media, and they can be in the form of images with a witty, catchy, or satirical text description. In this paper, we are dealing with propaganda that is often seen in Internet memes in recent times. Propaganda is communication, which frequently includes psychological and rhetorical techniques to manipulate or influence an audience to act or respond as the propagandist wants. To detect propaganda in Internet memes, we propose a multimodal deep learning fusion system that fuses the text and image feature representations and outperforms individual models based solely on either text or image modalities.
CMNEROne at SemEval-2022 Task 11: Code-Mixed Named Entity Recognition by leveraging multilingual data
Dowlagar suman,Radhika Mamidi
Technical Report, arXiv, 2022
@inproceedings{bib_CMNE_2022, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {CMNEROne at SemEval-2022 Task 11: Code-Mixed Named Entity Recognition by leveraging multilingual data}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
CMNEROne at SemEval-2022 Task 11: Code-Mixed Named Entity Recognition by leveraging multilingual data
LastResort at SemEval-2022 Task 4: Towards Patronizing and Condescending Language Detection using Pre-trained Transformer Based Models Ensembles
Samyak Agrawal,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2022
@inproceedings{bib_Last_2022, AUTHOR = {Agrawal, Samyak and Mamidi, Radhika }, TITLE = {LastResort at SemEval-2022 Task 4: Towards Patronizing and Condescending Language Detection using Pre-trained Transformer Based Models Ensembles}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2022}}
This paper presents our solutions for Task4 at SemEval2022: Patronizing and Condescending Language Detection. This shared task contains two sub-tasks. The first sub-task is a binary classification task whose goal is to predict whether a given paragraph contains any form of patronising or condescending language(PCL). For the second sub-task, given a paragraph, we have to find which PCL categories express the condescension. Here we have a total of 7 overlapping sub-categories for PCL. Our proposed solution uses BERT based ensembled models with hard voting and techniques applied to take care of class imbalances. Our paper describes the system architecture of the submitted solution and other experiments that we conducted. Our best performing models achieve an F1 score of 59.4 and 15.7 on sub-tasks 1 and 2 respectively.
Lastresort at semeval-2022 task 5: Towards misogyny identification using visual linguistic model ensembles and task-specific pretraining
Samyak Agrawal,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2022
@inproceedings{bib_Last_2022, AUTHOR = {Agrawal, Samyak and Mamidi, Radhika }, TITLE = {Lastresort at semeval-2022 task 5: Towards misogyny identification using visual linguistic model ensembles and task-specific pretraining}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2022}}
In current times, memes have become one of the most popular mediums to share jokes and information with the masses over the internet. Memes can also be used as tools to spread hatred and target women through degrading content disguised as humour. The task, Multimedia Automatic Misogyny Identification (MAMI), is to detect misogyny in these memes. This task is further divided into two sub-tasks: (A) Misogynous meme identification, where a meme should be categorized either as misogynous or not misogynous and (B) Categorizing these misogynous memes into potential overlapping subcategories. In this paper, we propose models leveraging task-specific pretraining with transfer learning on Visual Linguistic models. With our best performing models, we were able to achieve rank 5 th and 10 th on sub-tasks A and B respectively
Towards Toxic Positivity Detection
Ishan Sanjeev Upadhyay,K V Aditya Srivatsa,Radhika Mamidi
International Workshop on Natural Language Processing for Social Media;, SocialNLP-W, 2022
@inproceedings{bib_Towa_2022, AUTHOR = {Upadhyay, Ishan Sanjeev and Srivatsa, K V Aditya and Mamidi, Radhika }, TITLE = {Towards Toxic Positivity Detection}, BOOKTITLE = {International Workshop on Natural Language Processing for Social Media;}. YEAR = {2022}}
Over the past few years, there has been a growing concern around toxic positivity on social media which is a phenomenon where positivity is used to minimize one’s emotional experience. In this paper, we create a dataset for toxic positivity classification from Twitter and an inspirational quote website. We then perform benchmarking experiments using various text classification models and show the suitability of these models for the task. We achieved a macro F1 score of 0.71 and a weighted F1 score of 0.85 by using an ensemble model. To the best of our knowledge, our dataset is the first such dataset created.
Multi-Task Text Classification using Graph Convolutional Networks for Large-Scale Low Resource Language
Mounika Marreddy,Oota Subba Reddy,Vakada Lakshmi Sireesha,Chinni Venkata Charan,Radhika Mamidi
International Joint Conference on Neural Networks, IJCNN, 2022
@inproceedings{bib_Mult_2022, AUTHOR = {Marreddy, Mounika and Reddy, Oota Subba and Sireesha, Vakada Lakshmi and Charan, Chinni Venkata and Mamidi, Radhika }, TITLE = {Multi-Task Text Classification using Graph Convolutional Networks for Large-Scale Low Resource Language}, BOOKTITLE = {International Joint Conference on Neural Networks}. YEAR = {2022}}
—Graph Convolutional Networks (GCN) have achieved state-of-art results on single text classification tasks like sentiment analysis, emotion detection, etc. However, the performance is achieved by testing and reporting on resourcerich languages like English. Applying GCN for multi-task text classification is an unexplored area. Moreover, training a GCN or adopting an English GCN for Indian languages is often limited by data availability, rich morphological variation, syntax, and semantic differences. In this paper, we study the use of GCN for the Telugu language in single and multi-task settings for four natural language processing (NLP) tasks, viz. sentiment analysis (SA), emotion identification (EI), hate-speech (HS), and sarcasm detection (SAR). In order to evaluate the performance of GCN with one of the Indian languages, Telugu, we analyze the GCN based models with extensive experiments on four downstream tasks. In addition, we created an annotated Telugu dataset, TEL-NLP, for the four NLP tasks. Further, we propose a supervised graph reconstruction method, Multi-Task Text GCN (MT-Text GCN) on the Telugu that leverages to simultaneously (i) learn the low-dimensional word and sentence graph embeddings from word-sentence graph reconstruction using graph autoencoder (GAE) and (ii) perform multi-task text classification using these latent sentence graph embeddings. We argue that our proposed MT-Text GCN achieves significant improvements on TEL-NLP over existing Telugu pretrained word embeddings [1], and multilingual pretrained Transformer models: mBERT [2], and XLM-R [3]. On TEL-NLP, we achieve a high F1-score for four NLP tasks: SA (0.84), EI (0.55), HS (0.83) and SAR (0.66). Finally, we show our model’s quantitative and qualitative analysis on the four NLP tasks in Telugu. We open-source our TEL-NLP dataset, pretrained models, and code 1 . Index Terms—Graph Convolutional Networks, Text
cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation
Kshitij Gupta,Devansh Gautam,Radhika Mamidi
International conference on Pattern Recognition, ICPR, 2022
@inproceedings{bib_cViL_2022, AUTHOR = {Gupta, Kshitij and Gautam, Devansh and Mamidi, Radhika }, TITLE = {cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation}, BOOKTITLE = {International conference on Pattern Recognition}. YEAR = {2022}}
Vision-and-language tasks are gaining popularity in the research community, but the focus is still mainly on English. We propose a pipeline that utilizes English-only vision-language models to train a monolingual model for a target language. We propose to extend OSCAR+, a model which leverages object tags as anchor points for learning image-text alignments, to train on visual question answering datasets in different languages. We propose a novel approach to knowledge distillation to train the model in other languages using parallel sentences. Compared to other models that use the target language in the pretraining corpora, we can leverage an existing English model to transfer the knowledge to the target language using significantly lesser resources. We also release a large-scale visual question answering dataset in Japanese and Hindi language. Though we restrict our work to visual question answering, our model can be extended to any sequence-level classification task, and it can be extended to other languages as well. This paper focuses on two languages for the visual question answering task - Japanese and Hindi. Our pipeline outperforms the current state-of-the-art models by a relative increase of 4.4% and 13.4% respectively in accuracy.
Using Selective Masking as a Bridge between Pre-training and Fine-tuning
Tanish Lad,Himanshu Maheshwari,Shreyas Shankar Kottukkal,Radhika Mamidi
Neural Information Processing Systems Workshops, NeurIPS-W, 2022
@inproceedings{bib_Usin_2022, AUTHOR = {Lad, Tanish and Maheshwari, Himanshu and Kottukkal, Shreyas Shankar and Mamidi, Radhika }, TITLE = {Using Selective Masking as a Bridge between Pre-training and Fine-tuning}, BOOKTITLE = {Neural Information Processing Systems Workshops}. YEAR = {2022}}
Pre-training a language model and then fine-tuning it for downstream tasks has demonstrated state-of-the-art results for various NLP tasks. Pre-training is usually independent of the downstream task, and previous works have shown that this pre-training alone might not be sufficient to capture the task-specific nuances. We propose a way to tailor a pre-trained BERT model for the downstream task via task-specific masking before the standard supervised fine-tuning. For this, a word list is first collected specific to the task. For example, if the task is sentiment classification, we collect a small sample of words representing both positive and negative sentiments. Next, a word’s importance for the task, called the word’s task score, is measured using the word list. Each word is then assigned a probability of masking based on its task score. We experiment with different masking functions that assign the probability of masking based on the word’s task score. The BERT model is further trained on MLM objective, where masking is done using the above strategy. Following this standard supervised fine-tuning is done for different downstream tasks. Results on these tasks show that the selective masking strategy outperforms random masking, indicating its effectiveness.
Am I a Resource-Poor Language? Data Sets, Embeddings, Models and Analysis for four different NLP Tasks in Telugu Language
Mounika Marreddy,Oota Subba Reddy,Vakada Lakshmi Sireesha,Chinni Venkata Charan,Radhika Mamidi
ACM Trasactions on Asian and Low Resource Language Information Processing, TALLIP, 2022
@inproceedings{bib_Am_I_2022, AUTHOR = {Marreddy, Mounika and Reddy, Oota Subba and Sireesha, Vakada Lakshmi and Charan, Chinni Venkata and Mamidi, Radhika }, TITLE = {Am I a Resource-Poor Language? Data Sets, Embeddings, Models and Analysis for four different NLP Tasks in Telugu Language}, BOOKTITLE = {ACM Trasactions on Asian and Low Resource Language Information Processing}. YEAR = {2022}}
Due to the lack of a large annotated corpus, many resource-poor Indian languages struggle to reap the benefits of recent deep feature representations in Natural Language Processing (NLP). Moreover, adopting existing language models trained on large English corpora for Indian languages is often limited by data availability, rich morphological variation, syntax, and semantic differences. In this paper, we explore the traditional to recent efficient representations to overcome the challenges of a low resource language, Telugu. In particular, our main objective is to mitigate the low-resource problem for Telugu. Overall, we present several contribu- tions to a resource-poor language viz. Telugu. (i) a large annotated data (35,142 sentences in each task) for multiple NLP tasks such as sentiment analysis, emotion identification, hate-speech detection, and sarcasm detection, (ii) we create different lexicons for sentiment, emotion, and hate-speech for improving the effi- ciency of the models, (iii) pretrained word and sentence embeddings, and (iv) different pretrained language models for Telugu such as ELMo-Te, BERT-Te, RoBERTa-Te, ALBERT-Te, and DistilBERT-Te on a large Telugu corpus consisting of 8,015,588 sentences (1,637,408 sentences from Telugu Wikipedia and 6,378,180 sentences crawled from different Telugu websites). Further, we show that these representations significantly improve the performance of four NLP tasks and present the benchmark results for Telugu. We argue that our pretrained embeddings are competitive or better than the existing multilingual pretrained models: mBERT, XLM-R, and IndicBERT. Lastly, the fine-tuning of pretrained models show higher performance than linear probing results on four NLP tasks with the following F1-scores: Sentiment (68.72), Emotion (58.04), Hate-Speech (64.27), and Sarcasm (77.93). We also experiment on publicly available Telugu datasets (Named Entity Recognition, Article Genre Classification, and Sentiment Analysis) and find that our Telugu pretrained language models (BERT-Te and RoBERTa-Te) outperform the state-of-the-art system except for the sentiment task. We open-source our corpus, four different datasets, lexicons, embeddings, and code https://github.com/Cha14ran/DREAM-T. The pretrained Transformer models for Telugu are available at https://huggingface.co/ltrctelug
GAE-ISumm: Unsupervised Graph-Based Summarization of Indian Languages
lakshmi.sireesha vakada, Anudeep Ch,Mounika Marreddy,Subba Reddy Oota,Radhika Mamidi
Technical Report, arXiv, 2022
@inproceedings{bib_GAE-_2022, AUTHOR = {Vakada, Lakshmi.sireesha and Ch, Anudeep and Marreddy, Mounika and Oota, Subba Reddy and Mamidi, Radhika }, TITLE = {GAE-ISumm: Unsupervised Graph-Based Summarization of Indian Languages}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Document summarization aims to create a precise and coherent summary of a text document. Many deep learning summarization models are developed mainly for English, often requiring a large training corpus and efficient pre-trained language models and tools. However, English summarization models for low-resource Indian languages are often limited by rich morphological variation, syntax, and semantic differences. In this paper, we propose GAE-ISUMM, an unsupervised Indic summarization model that extracts summaries from text documents. In particular, our proposed model, GAE-ISUMM uses Graph Autoencoder (GAE) to learn text representations and a document summary jointly. We also provide a manually-annotated Telugu summarization dataset TELSUM, to experiment with our model GAE-ISUMM. Further, we experiment with the most publicly available Indian language summarization datasets to investigate the effectiveness of GAE-ISUMM on other Indian languages. Our experiments of GAE-ISUMM in seven languages make the following observations: (i) it is competitive or better than state-ofthe-art results on all datasets, (ii) it reports benchmark results on TELSUM, and (iii) the inclusion of positional and cluster information in the proposed model improved the performance of summaries. We open-source our dataset and code 1 . Index Terms—component, formatting, style, styling, insert
Am I a Resource-Poor Language? Data Sets, Embeddings, Models and Analysis for four diferent NLP tasks in Telugu Language
Mounika Marreddy,Oota Subba Reddy,Vakada Lakshmi Sireesha,Chinni Venkata Charan,Radhika Mamidi
ACM Trasactions on Asian and Low Resource Language Information Processing, TALLIP, 2022
@inproceedings{bib_Am_I_2022, AUTHOR = {Marreddy, Mounika and Reddy, Oota Subba and Sireesha, Vakada Lakshmi and Charan, Chinni Venkata and Mamidi, Radhika }, TITLE = {Am I a Resource-Poor Language? Data Sets, Embeddings, Models and Analysis for four diferent NLP tasks in Telugu Language}, BOOKTITLE = {ACM Trasactions on Asian and Low Resource Language Information Processing}. YEAR = {2022}}
Due to the lack of a large annotated corpus, many resource-poor Indian languages struggle to reap the beneits of recent deep feature representations in Natural Language Processing (NLP). Moreover, adopting existing language models trained on large English corpora for Indian languages is oten limited by data availability, rich morphological variation, syntax, and semantic diferences. In this paper, we explore the traditional to recent eicient representations to overcome the challenges of low resource language, Telugu. In particular, our main objective is to mitigate the low-resource problem for Telugu. Overall, we present several contributions to a resource-poor language viz. Telugu. (i) a large annotated data (35,142 sentences in each task) for multiple NLP tasks such as sentiment analysis, emotion identiication, hate-speech detection, and sarcasm detection, (ii) we create diferent lexicons for sentiment, emotion, and hate-speech for improving the eiciency of the models, (iii) pretrained word and sentence embeddings, and (iv) diferent pretrained language models for Telugu such as ELMo-Te, BERT-Te, RoBERTa-Te, ALBERT-Te, and DistilBERT-Te on a large Telugu corpus consisting of 80,15,588 sentences (16,37,408 sentences from Telugu Wikipedia and 63,78,180 sentences crawled from diferent Telugu websites). Further, we show that these representations signiicantly improve the performance of four NLP tasks and present the benchmark results for Telugu. We argue that our pretrained embeddings are competitive or beter than the existing multilingual pretrained models: mBERT, XLM-R, and IndicBERT. Lastly, the ine-tuning of pretrained models show higher performance than linear probing results on four NLP tasks with following F1-scores: Sentiment (68.72), Emotion (58.04), Hate-Speech (64.27) and Sarcasm (77.93). We also experiment on publicly available Telugu datasets (Named Entity Recognition, Article Genre Classiication, and Sentiment Analysis ), ind that our Telugu pretrained language models (BERT-Te and RoBERTa-Te) outperform the state-of-the-art system except for the sentiment task. We open-source our corpus, four diferent datasets, lexicons, embeddings, and code https://github.com/Cha14ran/DREAM-T. he pretrained Transformer models for Telugu are available at https://huggingface.co/ltrctelugu.
Towards Detecting Political Bias in Hindi News Articles
Samyak Agrawal,Kshitij Gupta,Devansh Gautam,Radhika Mamidi
Association for Computational Linguistics: Student Research Workshop, ACL - W, 2022
@inproceedings{bib_Towa_2022, AUTHOR = {Agrawal, Samyak and Gupta, Kshitij and Gautam, Devansh and Mamidi, Radhika }, TITLE = {Towards Detecting Political Bias in Hindi News Articles}, BOOKTITLE = {Association for Computational Linguistics: Student Research Workshop}. YEAR = {2022}}
Political propaganda in recent times has been amplified by media news portals through bi- ased reporting, creating untruthful narratives on serious issues causing misinformed public opinions with interests of siding and helping a particular political party. This issue pro- poses a challenging NLP task of detecting po- litical bias in news articles. We propose a transformer-based transfer learning method to fine-tune the pre-trained network on our data for this bias detection. As the required dataset for this particular task was not available, we created our dataset comprising 1388 Hindi news articles and their headlines from various Hindi news media outlets. We marked them on whether they are biased towards, against, or neutral to BJP, a political party, and the current ruling party at the centre in India.
LastResort at SemEval-2022 Task 4: Towards Patronizing and Condescending Language Detection using Pre-trained Transformer Based Model Ensembles
Samyak Agrawal,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2022
@inproceedings{bib_Last_2022, AUTHOR = {Agrawal, Samyak and Mamidi, Radhika }, TITLE = {LastResort at SemEval-2022 Task 4: Towards Patronizing and Condescending Language Detection using Pre-trained Transformer Based Model Ensembles}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2022}}
This paper presents our solutions for Task4 at SemEval2022: Patronizing and Condescend- ing Language Detection. This shared task con- tains two sub-tasks. The first sub-task is a bi- nary classification task whose goal is to predict whether a given paragraph contains any form of patronising or condescending language(PCL). For the second sub-task, given a paragraph, we have to find which PCL categories express the condescension. Here we have a total of 7 overlapping sub-categories for PCL. Our pro- posed solution uses BERT based ensembled models with hard voting and techniques applied to take care of class imbalances. Our paper de- scribes the system architecture of the submitted solution and other experiments that we con- ducted. Our best performing models achieve an F1 score of 59.4 and 15.7 on sub-tasks 1 and 2 respectively.
Detection of Fake Users in Twitter Using Network Representation and NLP
Manojit Chakraborty,Shubham Das,Radhika Mamidi
International Conference on Communication Systems & Networks, COMSNETS, 2022
@inproceedings{bib_Dete_2022, AUTHOR = {Chakraborty, Manojit and Das, Shubham and Mamidi, Radhika }, TITLE = {Detection of Fake Users in Twitter Using Network Representation and NLP}, BOOKTITLE = {International Conference on Communication Systems & Networks}. YEAR = {2022}}
Social Media Platforms like Facebook, Twitter, Instagram, etc. have large user base all around the world that generates huge amounts of data every second. This includes a lot of posts by fake and spam users, typically used by many organizations around the globe to gain a competitive edge over others. In this work, we aim at detecting such user accounts on Twitter. We show how to distinguish between Genuine and Spam accounts in Twitter using a novel combination of feature engineering, network representation, and natural language processing techniques. Index Terms—social media, fake user, Twitter, spam detection, graph, natural language processing
Towards Conversational Humor and Design
Tanishq Chaudhary,MAYANK GOYAL,Radhika Mamidi
Humor Research Conference, HRC, 2021
@inproceedings{bib_Towa_2021, AUTHOR = {Chaudhary, Tanishq and GOYAL, MAYANK and Mamidi, Radhika }, TITLE = {Towards Conversational Humor and Design}, BOOKTITLE = {Humor Research Conference}. YEAR = {2021}}
Well-defined jokes can be divided neatly into a setup and a punchline. While most works on humor today talk about a joke as a whole, the idea of generating punchlines to a setup has applications in conversational humor, where funny remarks usually occur with a non-funny context. Thus, this paper is based around two core concepts: Classification and the Generation of a punchline from a particular setup based on the Incongruity Theory. We first implement a feature-based machine learning model to classify humor. For humor generation, we use a neural model, and then merge the classical rule-based approaches with the neural approach to create a hybrid model. The idea behind being: combining insights gained from other tasks with the setup-punchline model and thus applying it to existing text generation approaches. We then use and compare our model with human written jokes with the help of human evaluators in a double-blind study.
Jibes & Delights: A Dataset of Targeted Insults and Compliments to Tackle Online Abuse
Ravsimar Singh Sodhi,Kartikey Pant,Radhika Mamidi
Workshop on Online Abuse and Harms, WOAH, 2021
@inproceedings{bib_Jibe_2021, AUTHOR = {Sodhi, Ravsimar Singh and Pant, Kartikey and Mamidi, Radhika }, TITLE = {Jibes & Delights: A Dataset of Targeted Insults and Compliments to Tackle Online Abuse}, BOOKTITLE = {Workshop on Online Abuse and Harms}. YEAR = {2021}}
Online abuse and offensive language on social media have become widespread problems in today’s digital age. In this paper, we contribute a Reddit-based dataset, consisting of 68,159 insults and 51,102 compliments targeted at individuals instead of targeting a particular community or race. Secondly, we benchmark multiple existing state-of-the-art models for both classification and unsupervised style transfer on the dataset. Finally, we analyse the experimental results and conclude that the transfer task is challenging, requiring the models to understand the high degree of creativity exhibited in the data.
OFFLangOne@DravidianLangTech-EACL2021: Transformers with the Class Balanced Loss for Offensive Language Identification in Dravidian Code-Mixed text.
Dowlagar suman,Radhika Mamidi
Conference of the European Chapter of the Association for Computational Linguistics (EACL), EACL, 2021
@inproceedings{bib_OFFL_2021, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {OFFLangOne@DravidianLangTech-EACL2021: Transformers with the Class Balanced Loss for Offensive Language Identification in Dravidian Code-Mixed text.}, BOOKTITLE = {Conference of the European Chapter of the Association for Computational Linguistics (EACL)}. YEAR = {2021}}
The intensity of online abuse has increased in recent years. Automated tools are being developed to prevent the use of hate speech and offensive content. Most of the technologies use natural language and machine learning tools to identify offensive text. In a multilingual society, where code-mixing is a norm, the hate content would be delivered in a code-mixed form in social media, which makes offensive content identification, further challenging. In this work, we participated in the EACL task to detect offensive content in the code-mixed social media scenario. The methodology uses a transformer model with transliteration and class balancing loss for offensive content identification. In this task, our model has been ranked 2 nd in Malayalam-English and 4 th in Tamil-English code-mixed languages.
A Pre-trained Transformer and CNN model with Joint Language ID and Part-of-Speech Tagging for Code-Mixed Social-Media Text
Dowlagar suman,Radhika Mamidi
Recent advance in Natural language Processing, RANLP, 2021
@inproceedings{bib_A_Pr_2021, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {A Pre-trained Transformer and CNN model with Joint Language ID and Part-of-Speech Tagging for Code-Mixed Social-Media Text}, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2021}}
Code-mixing (CM) is a frequently observed phenomenon that uses multiple languages in an utterance or sentence. There are no strict grammatical constraints observed in codemixing, and it consists of non-standard variations of spelling. The linguistic complexity resulting from the above factors made the computational analysis of the code-mixed language a challenging task. Language identification (LI) and part of speech (POS) tagging are the fundamental steps that help analyze the structure of the code-mixed text. Often, the LI and POS tagging tasks are interdependent in the code-mixing scenario. We project the problem of dealing with multilingualism and grammatical structure while analyzing the code-mixed sentence as a joint learning task. In this paper, we jointly train and optimize language detection and part of speech tagging models in the code-mixed scenario. We used a Transformer with convolutional neural network architecture. We train a joint learning method by combining POS tagging and LI models on code-mixed social media text obtained from the ICON shared task.
Automatic Learning Assistant in Telugu
Bommadi Meghana,Shreya Reddy Terupally,Radhika Mamidi
Workshop on Document-grounded Dialogue and Conversational Question Answering, DialDoc, 2021
@inproceedings{bib_Auto_2021, AUTHOR = {Meghana, Bommadi and Terupally, Shreya Reddy and Mamidi, Radhika }, TITLE = {Automatic Learning Assistant in Telugu}, BOOKTITLE = {Workshop on Document-grounded Dialogue and Conversational Question Answering}. YEAR = {2021}}
This paper presents a learning assistant that tests one’s knowledge and gives feedback that helps a person learn at a faster pace. A learning assistant (based on automated question generation) has extensive uses in education, information websites, self-assessment, FAQs, testing ML agents, research, etc. Multiple researchers, and companies have worked on Virtual Assistance, but majorly in English. We built our learning assistant for Telugu language to help with teaching in the mother tongue, which is the most efficient way of learning. Our system is built primarily based on Question Generation in Telugu. Many experiments were conducted on Question Generation in English in multiple ways. We have built the first hybrid machine learning and rule-based solution in Telugu, which proves efficient for short stories or short passages in children’s books. Our work covers the fundamental question forms with question types: adjective, yes/no, adverb, verb, when, where, whose, quotative, and quantitative (how many/how much). We constructed rules for question generation using Part of Speech (POS) tags and Universal Dependency (UD) tags along with linguistic information of the surrounding relevant context of the word. We used keyword matching, multilingual sentence embedding to evaluate the answer. Our system is primarily built on question generation in Telugu, and is also capable of evaluating the user’s answers to the generated questions.
A Survey of Recent Neural Network Models on Code-Mixed Indian Hate Speech Data
Suman Dowlagar,Radhika Mamidi
Forum for Information Retrieval Evaluation, FIRE, 2021
@inproceedings{bib_A_Su_2021, AUTHOR = {Dowlagar, Suman and Mamidi, Radhika }, TITLE = {A Survey of Recent Neural Network Models on Code-Mixed Indian Hate Speech Data}, BOOKTITLE = {Forum for Information Retrieval Evaluation}. YEAR = {2021}}
In recent years, given the exponential increase in social media content also led to an increase in online hate speech. We need automatic hate speech detection methods due to the volume of data on the web. Various approaches have been proposed to address hate speech and offensive content on social media. This paper surveys how neural-based models have rapidly evolved to address hate speech on multilingual code-mixed data. We discuss the current state of the research in hate speech and offensive language detection on code-mixed Indian datasets.
Towards Sentiment Analysis of Tobacco Products’ Usage in Social Media
Venkata Himakar Yanamandra,Kartikey Pant,Radhika Mamidi
Recent advance in Natural language Processing, RANLP, 2021
@inproceedings{bib_Towa_2021, AUTHOR = {Yanamandra, Venkata Himakar and Pant, Kartikey and Mamidi, Radhika }, TITLE = {Towards Sentiment Analysis of Tobacco Products’ Usage in Social Media}, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2021}}
Contemporary tobacco-related studies are mostly concerned with a single social media platform while missing out on a broader audience. Moreover, they are heavily reliant on labeled datasets, which are expensive to make. In this work, we explore sentiment and product identification on tobacco-related text from two social media platforms. We release SentiSmoke-Twitter and SentiSmoke-Reddit datasets, along with a comprehensive annotation schema for identifying tobacco products’ sentiment. We then perform benchmarking text classification experiments using state-of-the-art models, including BERT, RoBERTa, and DistilBERT. Our experiments show F1 scores as high as 0.72 for sentiment identification in the Twitter dataset, 0.46 for sentiment identification, and 0.57 for product identification using semi-supervised learning for Reddit.
Corpus Creation and Language Identification in Low-Resource Code-Mixed Telugu-English Text
Kusampudi Siva Subrahamanyam Varma,Chaluvadi Anudeep,Radhika Mamidi
Recent advance in Natural language Processing, RANLP, 2021
@inproceedings{bib_Corp_2021, AUTHOR = {Varma, Kusampudi Siva Subrahamanyam and Anudeep, Chaluvadi and Mamidi, Radhika }, TITLE = {Corpus Creation and Language Identification in Low-Resource Code-Mixed Telugu-English Text}, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2021}}
Code-Mixing (CM) is a common phenomenon in multilingual societies. CM plays a significant role in technology and medical fields where terminologies in the native language are not available or known. Language Identification (LID) of the CM data will help solve NLP tasks such as Spell Checking, Named Entity Recognition, Part-Of-Speech tagging, and Semantic Parsing. In the current era of machine learning, a common problem to the above-mentioned tasks is the availability of Learning data to train models. In this paper, we introduce two Telugu-English CM manually annotated datasets (Twitter dataset and Blog dataset). The Twitter dataset contains more romanization variability and misspelled words than the blog dataset. We compare across various classification models and perform extensive bench-marking using both Classical and Deep Learning Models for LID compared to existing models. We propose two architectures for language classification (Telugu and English) in CM data:(1) Word Level Classification (2) Sentence Level word-by-word Classification and compare these approaches presenting two strong baselines for LID on these datasets.
Sentiment Analysis in Code-Mixed Telugu-English Text with Unsupervised Data Normalization
Kusampudi Siva Subrahamanyam Varma,Sathineni Preetham Reddy,Radhika Mamidi
Recent advance in Natural language Processing, RANLP, 2021
@inproceedings{bib_Sent_2021, AUTHOR = {Varma, Kusampudi Siva Subrahamanyam and Reddy, Sathineni Preetham and Mamidi, Radhika }, TITLE = {Sentiment Analysis in Code-Mixed Telugu-English Text with Unsupervised Data Normalization}, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2021}}
In a multilingual society, people communicate in more than one language, leading to Code-Mixed data. Sentimental analysis on Code-Mixed Telugu-English Text (CMTET) poses unique challenges. The unstructured nature of the Code-Mixed Data is due to the informal language, informal transliterations, and spelling errors. In this paper, we introduce an annotated dataset for Sentiment Analysis in CMTET. Also, we report an accuracy of 80.22% on this dataset using novel unsupervised data normalization with a Multilayer Perceptron (MLP) model. This proposed data normalization technique can be extended to any NLP task involving CMTET. Further, we report an increase of 2.53% accuracy due to this data normalization approach in our best model.
Developing Conversational Data and Detection of Conversational Humor in Telugu
Vaishnavi Pamulapati,Radhika Mamidi
Workshop on Computational Approaches to Discourse, CODI-W, 2021
@inproceedings{bib_Deve_2021, AUTHOR = {Pamulapati, Vaishnavi and Mamidi, Radhika }, TITLE = {Developing Conversational Data and Detection of Conversational Humor in Telugu}, BOOKTITLE = {Workshop on Computational Approaches to Discourse}. YEAR = {2021}}
In the field of humor research, there has been a recent surge of interest in the sub-domain of Conversational Humor (CH). This study has two main objectives.(a) develop a conversational (humorous and non-humorous) dataset in Telugu.(b) detect CH in the compiled dataset. In this paper, the challenges faced while collecting the data and experiments carried out are elucidated. Transfer learning and non-transfer learning techniques are implemented by utilizing pre-trained models such as FastText word embeddings, BERT language models and Text GCN, which learns the word and document embeddings simultaneously of the corpus given. State-of-the-art results are observed with a 99.3% accuracy and a 98.5% f1 score achieved by BERT.
TEASER: Towards Efficient Aspect-based SEntiment analysis and Recognition
BAJAJ VAIBHAV GANESH,Kartikey Pant,Ishan Sanjeev Upadhyay,M Srinath Nair,Radhika Mamidi
Recent advance in Natural language Processing, RANLP, 2021
@inproceedings{bib_TEAS_2021, AUTHOR = {GANESH, BAJAJ VAIBHAV and Pant, Kartikey and Upadhyay, Ishan Sanjeev and Nair, M Srinath and Mamidi, Radhika }, TITLE = {TEASER: Towards Efficient Aspect-based SEntiment analysis and Recognition}, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2021}}
Sentiment analysis aims to detect the overall sentiment, i.e., the polarity of a sentence, paragraph, or text span, without considering the entities mentioned and their aspects. Aspectbased sentiment analysis aims to extract the aspects of the given target entities and their respective sentiments. Prior works formulate this as a sequence tagging problem or solve this task using a span-based extract-thenclassify framework where first all the opinion targets are extracted from the sentence, and then with the help of span representations, the targets are classified as positive, negative, or neutral. The sequence tagging problem suffers from issues like sentiment inconsistency and colossal search space. Whereas, Span-based extract-then-classify framework suffers from issues such as half-word coverage and overlapping spans. To overcome this, we propose a similar span-based extract-then-classify framework with a novel and improved heuristic. Experiments on the three benchmark datasets (Restaurant14, Laptop14, Restaurant15) show our model consistently outperforms the current state-of-the-art. Moreover, we also present a novel supervised movie reviews dataset (Movie20) and a pseudo-labeled movie reviews dataset (moviesLarge) made explicitly for this task in English language and report the results on the novel Movie20 dataset as well.
IIITH at SemEval-2021 Task 7: Leveraging transformer-based humourous and offensive text detection architectures using lexical and hurtlex features and task adaptive pretraining
Tathagata Raha,Ishan Sanjeev Upadhyay,Radhika Mamidi,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2021
@inproceedings{bib_IIIT_2021, AUTHOR = {Raha, Tathagata and Upadhyay, Ishan Sanjeev and Mamidi, Radhika and Kalidindi, Vasudeva Varma }, TITLE = {IIITH at SemEval-2021 Task 7: Leveraging transformer-based humourous and offensive text detection architectures using lexical and hurtlex features and task adaptive pretraining}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2021}}
This paper describes our approach (IIITH) for SemEval-2021 Task 5: HaHackathon: Detecting and Rating Humor and Offense. Our results focus on two major objectives: (i) Effect of task adaptive pretraining on the performance of transformer based models (ii) How does lexical and hurtlex features help in quantifying humour and offense. In this paper, we provide a detailed description of our approach along with comparisions mentioned above.
How do different factors Impact the Inter-language Similarity? A Case Study on Indian languages
sourav kumar,Salil Aggarwal,Dipti Mishra Sharma,Radhika Mamidi
Association for Computational Linguistics and the 11th International Joint Conference on Natural Lan, ACL -IJCNLP SRW, 2021
@inproceedings{bib_How__2021, AUTHOR = {Kumar, Sourav and Aggarwal, Salil and Sharma, Dipti Mishra and Mamidi, Radhika }, TITLE = {How do different factors Impact the Inter-language Similarity? A Case Study on Indian languages}, BOOKTITLE = {Association for Computational Linguistics and the 11th International Joint Conference on Natural Lan}. YEAR = {2021}}
India is one of the most linguistically diverse nations of the world and is culturally very rich. Most of these languages are somewhat similar to each other on account of sharing a common ancestry or being in contact for a long period of time (Bhattacharyya et al., 2016). Nowadays, researchers are constantly putting efforts in utilizing the language relatedness to improve the performance of various NLP systems such as cross lingual semantic search, machine translation (Kunchukuttan and Bhattacharyya, 2020), sentiment analysis systems, etc. So in this paper, we performed an extensive case study on similarity involving languages of the Indian subcontinent. Language similarity prediction is defined as the task of measuring how similar the two languages are on the basis of their lexical, morphological and syntactic features. In this study, we concentrate only on the approach to calculate lexical similarity between Indian languages by looking at various factors such as size and type of corpus, similarity algorithms, subword segmentation, etc. The main takeaways from our work are: (i) Relative order of the language similarities largely remain the same, regardless of the factors mentioned above, (ii) Similarity within the same language family is higher, (iii) Languages share more lexical features at the subword level.
Efficient Multilingual Text Classification for Indian languages
Salil Aggarwal,sourav kumar,Radhika Mamidi
Recent advance in Natural language Processing, RANLP, 2021
@inproceedings{bib_Effi_2021, AUTHOR = {Aggarwal, Salil and Kumar, sourav and Mamidi, Radhika }, TITLE = {Efficient Multilingual Text Classification for Indian languages}, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2021}}
India is one of the richest language hubs on the earth and is very diverse and multilingual. But apart from a few Indian languages, most of them are still considered to be resource poor. Since most of the NLP techniques either require linguistic knowledge that can only be developed by experts and native speakers of that language or they require a lot of labelled data which is again expensive to generate, the task of text classification becomes challenging for most of the Indian languages. The main objective of this paper is to see how one can benefit from the lexical similarity found in Indian languages in a multilingual scenario. Can a classification model trained on one Indian language be reused for other Indian languages? So, we performed zero-shot text classification via exploiting lexical similarity and we observed that our model performs best in those cases where the vocabulary overlap between the language datasets is maximum. Our experiments also confirm that a single multilingual model trained via exploiting language relatedness outperforms the baselines by significant margins.
Political Discourse Analysis: A Case Study of Code Mixing and Code Switching in Political Speeches
Dama Sravani,V A Lalitha Kameswari,Radhika Mamidi
Computational Approaches to Linguistic Code-Switching, CALCS, 2021
@inproceedings{bib_Poli_2021, AUTHOR = {Sravani, Dama and Kameswari, V A Lalitha and Mamidi, Radhika }, TITLE = {Political Discourse Analysis: A Case Study of Code Mixing and Code Switching in Political Speeches }, BOOKTITLE = {Computational Approaches to Linguistic Code-Switching}. YEAR = {2021}}
Political discourse is one of the most interesting data to study power relations in the framework of Critical Discourse Analysis. With the increase in the modes of textual and spoken forms of communication, politicians use language and linguistic mechanisms that contribute significantly in building their relationship with people, especially in a multilingual country like India with many political parties with different ideologies. This paper analyses code-mixing and code-switching in Telugu political speeches to determine the factors responsible for their usage levels in various social settings and communicative contexts. We also compile a detailed set of rules capturing dialectal variations between Standard and Telangana dialects of Telugu.
Clickbait Detection in Telugu: Overcoming NLP Challenges in Resource-Poor Languages using Benchmarked Techniques
Mounika Marreddy,Oota Subba Reddy,Vakada Lakshmi Sireesha,Chinni Venkata Charan,Radhika Mamidi
International Joint Conference on Neural Networks, IJCNN, 2021
@inproceedings{bib_Clic_2021, AUTHOR = {Marreddy, Mounika and Reddy, Oota Subba and Sireesha, Vakada Lakshmi and Charan, Chinni Venkata and Mamidi, Radhika }, TITLE = {Clickbait Detection in Telugu: Overcoming NLP Challenges in Resource-Poor Languages using Benchmarked Techniques}, BOOKTITLE = {International Joint Conference on Neural Networks}. YEAR = {2021}}
Clickbait headlines have become a nudge in social media and news websites. The methods to identify clickbaits are largely being developed for English. There is a need for the same in other languages as well with the increase in the usage of social media platforms in different languages. In this work, we present an annotated clickbait dataset of 112,657 headlines that can be used for building an automated clickbait detection system for Telugu, a resource-poor language. Our contribution in this paper includes (i) generation of the latest pre-trained language models, including RoBERTa, ALBERT, and ELECTRA trained on a large Telugu corpora of 8,015,588 sentences that we had collected, (ii) data analysis and benchmarking the performance of different approaches ranging from hand-crafted features to state-of-the-art models. We show that the pre-trained language models trained on Telugu outperform the existing pre-trained models viz. BERT-Mulingual-Case [1], XLM-MLM [2], and XLM-R [3] on clickbait task. On a large Telugu clickbait dataset of 112,657 samples, the Light Gradient Boosted Machines (LGBM) model achieves an F1- score of 0.94 for clickbait headlines. For Non-Clickbait headlines, F1-score of 0.93 is obtained which is similar to that of Clickbait class. We open-source our dataset, pre-trained models, and code1
Towards Quantifying Magnitude of Political Bias in News Articles Using a Novel Annotation Schema
V A Lalitha Kameswari,Radhika Mamidi
Recent advance in Natural language Processing, RANLP, 2021
@inproceedings{bib_Towa_2021, AUTHOR = {Kameswari, V A Lalitha and Mamidi, Radhika }, TITLE = {Towards Quantifying Magnitude of Political Bias in News Articles Using a Novel Annotation Schema}, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2021}}
Media bias is a predominant phenomenon present in most forms of print and electronic media such as news articles, blogs, tweets, etc. Since media plays a pivotal role in shaping public opinion towards political happenings, both political parties and media houses often use such sources as outlets to propagate their own prejudices to the public. There has been some research on detecting political bias in news articles. However, none of it attempts to analyse the nature of bias or quantify the magnitude of the bias in a given text. This paper presents a political bias annotated corpus viz. PoBiCo-21, which is annotated using a schema specifically designed with 10 labels to capture various techniques used to create political bias in news. We create a ranking of these techniques based on their contribution to bias. After validating the ranking, we propose methods to use it to quantify the magnitude of bias in political news articles.
Jibes & Delights: A Dataset of Targeted Insults and Compliments to Tackle Online Abuse
Ravsimar Singh Sodhi,Kartikey Pant,Radhika Mamidi
International Joint Conference on Natural Language Processing Workshop, IJCNLP-W, 2021
@inproceedings{bib_Jibe_2021, AUTHOR = {Sodhi, Ravsimar Singh and Pant, Kartikey and Mamidi, Radhika }, TITLE = {Jibes & Delights: A Dataset of Targeted Insults and Compliments to Tackle Online Abuse}, BOOKTITLE = {International Joint Conference on Natural Language Processing Workshop}. YEAR = {2021}}
Online abuse and offensive language on social media have become widespread problems in today’s digital age. In this paper, we contribute a Reddit-based dataset, consisting of 68,159 insults and 51,102 compliments targeted at individuals instead of targeting a particular community or race. Secondly, we benchmark multiple existing state-of-the-art models for both classification and unsupervised style transfer on the dataset. Finally, we analyse the experimental results and conclude that the transfer task is challenging, requiring the models to understand the high degree of creativity exhibited in the data.
HASOCOne@ FIRE-HASOC2020: Using BERT and Multilingual BERT models for Hate Speech Detection
Dowlagar suman,Radhika Mamidi
Hate Speech and Offensive Content Identification in Indo-European Languages, HASOC, 2021
@inproceedings{bib_HASO_2021, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {HASOCOne@ FIRE-HASOC2020: Using BERT and Multilingual BERT models for Hate Speech Detection}, BOOKTITLE = {Hate Speech and Offensive Content Identification in Indo-European Languages}. YEAR = {2021}}
Hateful and Toxic content has become a significant concern in today's world due to an exponential rise in social media. The increase in hate speech and harmful content motivated researchers to dedicate substantial efforts to the challenging direction of hateful content identification. In this task, we propose an approach to automatically classify hate speech and offensive content. We have used the datasets obtained from FIRE 2019 and 2020 shared tasks. We perform experiments by taking advantage of transfer learning models. We observed that the pre-trained BERT model and the multilingual-BERT model gave the best results. The code is made publically available at https://github.com/suman101112/hasoc-fire-2020.
Does a Hybrid Neural Network based Feature Selection Model Improve Text Classification?
Dowlagar suman,Radhika Mamidi
Technical Report, arXiv, 2021
@inproceedings{bib_Does_2021, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {Does a Hybrid Neural Network based Feature Selection Model Improve Text Classification?}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
Text classification is a fundamental problem in the field of natural language processing. Text classification mainly focuses on giving more importance to all the relevant features that help classify the textual data. Apart from these, the text can have redundant or highly correlated features. These features increase the complexity of the classification algorithm. Thus, many dimensionality reduction methods were proposed with the traditional machine learning classifiers. The use of dimensionality reduction methods with machine learning classifiers has achieved good results. In this paper, we propose a hybrid feature selection method for obtaining relevant features by combining various filter-based feature selection methods and fastText classifier. We then present three ways of implementing a feature selection and neural network pipeline. We observed a reduction in training time when feature selection methods are used along with neural networks. We also observed a slight increase in accuracy on some datasets.
Multilingual Pre-Trained Transformers and Convolutional NN Classification Models for Technical Domain Identification
Dowlagar suman,Radhika Mamidi
International Conference on Natural Language Processing, ICNLP, 2021
@inproceedings{bib_Mult_2021, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {Multilingual Pre-Trained Transformers and Convolutional NN Classification Models for Technical Domain Identification}, BOOKTITLE = {International Conference on Natural Language Processing}. YEAR = {2021}}
In this paper, we present a transfer learning system to perform technical domain identification on multilingual text data. We have submitted two runs, one uses the transformer model BERT, and the other uses XLM-ROBERTa with the CNN model for text classification. These models allowed us to identify the domain of the given sentences for the ICON 2020 shared Task, TechDOfication: Technical Domain Identification. Our system ranked the best for the subtasks 1d, 1g for the given TechDOfication dataset.
Unsupervised Technical Domain Terms Extraction using Term Extractor
Dowlagar suman,Radhika Mamidi
International Conference on Natural Language Processing, ICNLP, 2021
@inproceedings{bib_Unsu_2021, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {Unsupervised Technical Domain Terms Extraction using Term Extractor}, BOOKTITLE = {International Conference on Natural Language Processing}. YEAR = {2021}}
Terminology extraction, also known as term extraction, is a subtask of information extraction. The goal of terminology extraction is to extract relevant words or phrases from a given corpus automatically. This paper focuses on the unsupervised automated domain term extraction method that considers chunking, preprocessing, and ranking domain-specific terms using relevance and cohesion functions for ICON 2020 shared task 2: TermTraction.
Cmsaone@ dravidian-codemix-fire2020: A meta embedding and transformer model for code-mixed sentiment analysis on social media text
Dowlagar suman,Radhika Mamidi
Technical Report, arXiv, 2021
@inproceedings{bib_Cmsa_2021, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {Cmsaone@ dravidian-codemix-fire2020: A meta embedding and transformer model for code-mixed sentiment analysis on social media text}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
Code-mixing(CM) is a frequently observed phenomenon that uses multiple languages in an utterance or sentence. CM is mostly practiced on various social media platforms and in informal conversations. Sentiment analysis (SA) is a fundamental step in NLP and is well studied in the monolingual text. Code-mixing adds a challenge to sentiment analysis due to its non-standard representations. This paper proposes a meta embedding with a transformer method for sentiment analysis on the Dravidian code-mixed dataset. In our method, we used meta embeddings to capture rich text representations. We used the proposed method for the Task: "Sentiment Analysis for Dravidian Languages in Code-Mixed Text", and it achieved an F1 score of and for the given Dravidian code mixed data sets. The code is provided in the Github https://github.com/suman101112/fire-2020-Dravidian-CodeMix.
Analyzing Curriculum Learning for Sentiment Analysis along Task Difficulty, Pacing and Visualization Axes
VIJJINI ANVESH RAO,KAVERI ANURANJANA,Radhika Mamidi
Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, WASSA, 2021
@inproceedings{bib_Anal_2021, AUTHOR = {RAO, VIJJINI ANVESH and ANURANJANA, KAVERI and Mamidi, Radhika }, TITLE = {Analyzing Curriculum Learning for Sentiment Analysis along Task Difficulty, Pacing and Visualization Axes}, BOOKTITLE = {Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis}. YEAR = {2021}}
While Curriculum Learning (CL) has recently gained traction in Natural language Processing Tasks, it is still not adequately analyzed. Previous works only show their effectiveness but fail short to explain and interpret the internal workings fully. In this paper, we analyze curriculum learning in sentiment analysis along multiple axes. Some of these axes have been proposed by earlier works that need more in-depth study. Such analysis requires understanding where curriculum learning works and where it does not. Our axes of analysis include Task difficulty on CL, comparing CL pacing techniques, and qualitative analysis by visualizing the movement of attention scores in the model as curriculum phases progress. We find that curriculum learning works best for difficult tasks and may even lead to a decrement in performance for tasks with higher performance without curriculum learning. We see that One-Pass curriculum strategies suffer from catastrophic forgetting and attention movement visualization within curriculum pacing. This shows that curriculum learning breaks down the challenging main task into easier sub-tasks solved sequentially.
Hopeful Men@LT-EDI-EACL2021: Hope Speech Detection Using Indic Transliteration and Transformers
Ishan Sanjeev Upadhyay,E Nikhil,Anshul Wadhawan,Radhika Mamidi
Conference of the European Chapter of the Association for Computational Linguistics (EACL), EACL, 2021
@inproceedings{bib_Hope_2021, AUTHOR = {Upadhyay, Ishan Sanjeev and Nikhil, E and Wadhawan, Anshul and Mamidi, Radhika }, TITLE = {Hopeful Men@LT-EDI-EACL2021: Hope Speech Detection Using Indic Transliteration and Transformers}, BOOKTITLE = {Conference of the European Chapter of the Association for Computational Linguistics (EACL)}. YEAR = {2021}}
This paper aims to describe the approach we used to detect hope speech in the HopeEDI dataset. We experimented with two approaches. In the first approach, we used contextual embeddings to train classifiers using logistic regression, random forest, SVM, and LSTM based models.The second approach involved using a majority voting ensemble of 11 models which were obtained by fine-tuning pre-trained transformer models (BERT, ALBERT, RoBERTa, IndicBERT) after adding an output layer. We found that the second approach was superior for English, Tamil and Malayalam. Our solution got a weighted F1 score of 0.93, 0.75 and 0.49 for English,Malayalam and Tamil respectively. Our solution ranked first in English, eighth in Malayalam and eleventh in Tamil.
Multichannel LSTM-CNN for Telugu technical domain identification
Gundapu Sunil,Radhika Mamidi
International Conference on Natural Language Processing, ICNLP, 2021
@inproceedings{bib_Mult_2021, AUTHOR = {Sunil, Gundapu and Mamidi, Radhika }, TITLE = {Multichannel LSTM-CNN for Telugu technical domain identification}, BOOKTITLE = {International Conference on Natural Language Processing}. YEAR = {2021}}
With the instantaneous growth of text information, retrieving domain-oriented information from the text data has a broad range of applications in Information Retrieval and Natural language Processing. Thematic keywords give a compressed representation of the text. Usually, Domain Identification plays a significant role in Machine Translation, Text Summarization, Question Answering, Information Extraction, and Sentiment Analysis. In this paper, we proposed the Multichannel LSTM-CNN methodology for Technical Domain Identification for Telugu. This architecture was used and evaluated in the context of the ICON shared task TechDOfication 2020 (task h), and our system got 69.9% of the F1 score on the test dataset and 90.01% on the validation set.
Towards Conversational Humor Analysis and Design
Tanishq Chaudhary,Mayank Goel,Radhika Mamidi
Humor Research Conference, HRC, 2021
@inproceedings{bib_Towa_2021, AUTHOR = {Chaudhary, Tanishq and Goel, Mayank and Mamidi, Radhika }, TITLE = {Towards Conversational Humor Analysis and Design}, BOOKTITLE = {Humor Research Conference}. YEAR = {2021}}
Well-defined jokes can be divided neatly into a setup and a punchline. While most works on humor today talk about a joke as a whole, the idea of generating punchlines to a setup has applications in conversational humor, where funny remarks usually occur with a non-funny context. Thus, this paper is based around two core concepts: Classification and the Generation of a punchline from a particular setup based on the Incongruity Theory. We first implement a feature-based machine learning model to classify humor. For humor generation, we use a neural model, and then merge the classical rule-based approaches with the neural approach to create a hybrid model. The idea behind being: combining insights gained from other tasks with the setup-punchline model and thus applying it to existing text generation approaches. We then use and compare our model with human written jokes with the help of human evaluators in a double-blind study.
Pre-trained Transformers with Convolutional Neural Networks for Hope Speech Detection.
Dowlagar suman,Radhika Mamidi
Conference of the European Chapter of the Association for Computational Linguistics (EACL), EACL, 2021
@inproceedings{bib_Pre-_2021, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {Pre-trained Transformers with Convolutional Neural Networks for Hope Speech Detection.}, BOOKTITLE = {Conference of the European Chapter of the Association for Computational Linguistics (EACL)}. YEAR = {2021}}
Hope is an essential aspect of mental health stability and recovery in every individual in this fast-changing world. Any tools and methods developed for detection, analysis, and generation of hope speech will be beneficial. In this paper, we propose a model on hope-speech detection to automatically detect web content that may play a positive role in diffusing hostility on social media. We perform the experiments by taking advantage of pre-processing and transfer-learning models. We observed that the pre-trained multilingual-BERT model with convolution neural networks gave the best results. Our model ranked first, third, and fourth ranks on English, Malayalam-English, and Tamil-English code-mixed datasets.
One World, One Family: Hope Speech Detection with BERT Transformer Model
Gundapu Sunil,Radhika Mamidi
Conference of the European Chapter of the Association for Computational Linguistics (EACL), EACL, 2021
@inproceedings{bib_One__2021, AUTHOR = {Sunil, Gundapu and Mamidi, Radhika }, TITLE = {One World, One Family: Hope Speech Detection with BERT Transformer Model}, BOOKTITLE = {Conference of the European Chapter of the Association for Computational Linguistics (EACL)}. YEAR = {2021}}
The rapid rise of online social networks like YouTube, Facebook, Twitter allows people to express their views more widely online. However, at the same time, it can lead to an increase in conflict and hatred among consumers in the form of freedom of speech. Therefore, it is essential to take a positive strengthening method to research on encouraging, positive, helping, and supportive social media content. In this paper, we describe a Transformer-based BERT model for Hope speech detection for equality, diversity, and inclusion, submitted for LT-EDI-2021 Task 2. Our model achieves a weighted averaged f1-score of 0.93 on the test set.
Transformers with the Class Balanced Loss for Offensive Language Identification in Dravidian Code-Mixed text.
Dowlagar suman,Radhika Mamidi
Conference of the European Chapter of the Association for Computational Linguistics (EACL), EACL, 2021
@inproceedings{bib_Tran_2021, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {Transformers with the Class Balanced Loss for Offensive Language Identification in Dravidian Code-Mixed text.}, BOOKTITLE = {Conference of the European Chapter of the Association for Computational Linguistics (EACL)}. YEAR = {2021}}
The intensity of online abuse has increased in recent years. Automated tools are being developed to prevent the use of hate speech and offensive content. Most of the technologies use natural language and machine learning tools to identify offensive text. In a multilingual society, where code-mixing is a norm, the hate content would be delivered in a code-mixed form in social media, which makes the offensive content identification, further challenging. In this work, we participated in the EACL task to detect offensive content in the code-mixed social media scenario. The methodology uses a transformer model with transliteration and class balancing loss for offensive content identification. In this task, our model has been ranked 2nd in Malayalam-English and 4th in Tamil-English code-mixed languages.
Graph convolutional networks with multi-headed attention for code-mixed sentiment analysis
Dowlagar suman,Radhika Mamidi
Workshop on Spoken Language Technologies for Under-Resourced Languages, SLTU-W, 2021
@inproceedings{bib_Grap_2021, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {Graph convolutional networks with multi-headed attention for code-mixed sentiment analysis}, BOOKTITLE = {Workshop on Spoken Language Technologies for Under-Resourced Languages}. YEAR = {2021}}
Code-mixing is a frequently observed phenomenon in multilingual communities where a speaker uses multiple languages in an utterance or sentence. Code-mixed texts are abundant, especially in social media, and pose a problem for NLP tools as they are typically trained on monolingual corpora. Recently, finding the sentiment from code-mixed text has been attempted by some researchers in SentiMix SemEval 2020 and Dravidian-CodeMix FIRE 2020 shared tasks. Mostly, the attempts include traditional methods, long short term memory, convolutional neural networks, and transformer models for code-mixed sentiment analysis (CMSA). However, no study has explored graph convolutional neural networks on CMSA. In this paper, we propose the graph convolutional networks (GCN) for sentiment analysis on code-mixed text. We have used the datasets from the Dravidian-CodeMix FIRE 2020. Our experimental results on multiple CMSA datasets demonstrate that the GCN with multi-headed attention model has shown an improvement in classification metrics.
Detection of Fake Users in SMPs Using NLP and Graph Embeddings
Manojit Chakraborty,Shubham Das,Radhika Mamidi
Technical Report, arXiv, 2021
@inproceedings{bib_Dete_2021, AUTHOR = {Chakraborty, Manojit and Das, Shubham and Mamidi, Radhika }, TITLE = {Detection of Fake Users in SMPs Using NLP and Graph Embeddings}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
Social Media Platforms (SMPs) like Facebook, Twitter, Instagram etc. have large user base all around the world that generates huge amount of data every second. This includes a lot of posts by fake and spam users, typically used by many organisations around the globe to have competitive edge over others. In this work, we aim at detecting such user accounts in Twitter using a novel approach. We show how to distinguish between Genuine and Spam accounts in Twitter using a combination of Graph Representation Learning and Natural Language Processing techniques.
Gated Convolutional Sequence to Sequence Based Learning for English-Hingilsh Code-Switched Machine Translation.
Dowlagar suman,Radhika Mamidi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2021
@inproceedings{bib_Gate_2021, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {Gated Convolutional Sequence to Sequence Based Learning for English-Hingilsh Code-Switched Machine Translation.}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2021}}
Code-Switching is the embedding of linguistic units or phrases from two or more languages in a single sentence. This phenomenon is practiced in all multilingual communities and is prominent in social media. Consequently, there is a growing need to understand code-switched translations by translating the code-switched text into one of the standard languages or vice versa. Neural Machine translation is a well-studied research problem in the monolingual text. In this paper, we have used the gated convolutional sequences to sequence networks for English-Hinglish translation. The convolutions in the model help to identify the compositional structure in the sequences more easily. The model relies on gating and performs multiple attention steps at encoder and decoder layers.
ViTA: Visual-Linguistic Translation by Aligning Object Tags
Kshitij Gupta,Devansh Gautam,Radhika Mamidi
Workshop on Asian Translation, WAT, 2021
@inproceedings{bib_ViTA_2021, AUTHOR = {Gupta, Kshitij and Gautam, Devansh and Mamidi, Radhika }, TITLE = {ViTA: Visual-Linguistic Translation by Aligning Object Tags}, BOOKTITLE = {Workshop on Asian Translation}. YEAR = {2021}}
Multimodal Machine Translation (MMT) enriches the source text with visual information for translation. It has gained popularity in recent years, and several pipelines have been proposed in the same direction. Yet, the task lacks quality datasets to illustrate the contribution of visual modality in the translation systems. In this paper, we propose our system for the Multimodal Translation Task of WAT 2021 from English to Hindi. We propose to use mBART, a pretrained multilingual sequence-to-sequence model, for the textual-only translations. Further, we bring the visual information to a textual domain by extracting object tags from the image and enhance the input for the multimodal task. We also explore the robustness of our system by systematically degrading the source text. Finally, we achieve a BLEU score of 44.6 and 51.6 on the test set and challenge set of the task.
Volta at SemEval-2021 Task 6: Towards Detecting Persuasive Texts and Images using Textual and Multimodal Ensemble
Kshitij Gupta,Devansh Gautam,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2021
@inproceedings{bib_Volt_2021, AUTHOR = {Gupta, Kshitij and Gautam, Devansh and Mamidi, Radhika }, TITLE = {Volta at SemEval-2021 Task 6: Towards Detecting Persuasive Texts and Images using Textual and Multimodal Ensemble}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2021}}
Memes are one of the most popular types of content used to spread information online. They can influence a large number of people through rhetorical and psychological techniques. The task, Detection of Persuasion Techniques in Texts and Images, is to detect these persuasive techniques in memes. It consists of three subtasks: (A) Multi-label classification using textual content, (B) Multi-label classification and span identification using textual content, and (C) Multi-label classification using visual and textual content. In this paper, we propose a transfer learning approach to fine-tune BERT-based models in different modalities. We also explore the effectiveness of ensembles of models trained in different modalities. We achieve an F1-score of 57.0, 48.2, and 52.1 in the corresponding subtasks.
SentiInc: Incorporating Sentiment Information into Sentiment Transfer Without Parallel Data
Kartikey Pant,Yash Verma,Radhika Mamidi
European Conference on Information Retrieval, ECIR, 2020
@inproceedings{bib_Sent_2020, AUTHOR = {Pant, Kartikey and Verma, Yash and Mamidi, Radhika }, TITLE = {SentiInc: Incorporating Sentiment Information into Sentiment Transfer Without Parallel Data}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2020}}
Sentiment-to-sentiment transfer involves changing the sentiment of the given text while preserving the underlying information. In this work, we present a model SentiInc for sentiment-to-sentiment transfer using unpaired mono-sentiment data. Existing sentiment-tosentiment transfer models ignore the valuable sentiment-specific details already present in the text. We address this issue by providing a simple framework for encoding sentiment-specific information in the target sentence while preserving the content information. This is done by incorporating sentiment based loss in the back-translation based style transfer. Extensive experiments over the Yelp dataset show that the SentiInc outperforms state-of-the-art methods by a margin of as large as ∼11% in G-score. The results also demonstrate that our model produces sentiment-accurate and information-preserved sentences.
gundapusunil at SemEval-2020 Task 8: Multimodal Memotion Analysis
Gundapu Sunil,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2020
@inproceedings{bib_gund_2020, AUTHOR = {Sunil, Gundapu and Mamidi, Radhika }, TITLE = {gundapusunil at SemEval-2020 Task 8: Multimodal Memotion Analysis}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2020}}
Recent technological advancements in the Internet and Social media usage have resulted in the evolution of faster and efficient platforms of communication. These platforms include visual, textual and speech mediums and have brought a unique social phenomenon called Internet memes. Internet memes are in the form of images with witty, catchy, or sarcastic text descriptions. In this paper, we present a multi-modal sentiment analysis system using deep neural networks combining Computer Vision and Natural Language Processing. Our aim is different than the normal sentiment analysis goal of predicting whether a text expresses positive or negative sentiment; instead, we aim to classify the Internet meme as a positive, negative, or neutral, identify the type of humor expressed and quantify the extent to which a particular effect is being expressed. Our system has been developed using CNN and LSTM and outperformed the baseline score.
From Humour to Hatred: A Computational Analysis of Off-Colour Humour
VIKRAM AHUJA,Radhika Mamidi,Navjyoti Singh
International Conference on Natural Language Processing and Chinese Computing, NLPCC, 2020
@inproceedings{bib_From_2020, AUTHOR = {AHUJA, VIKRAM and Mamidi, Radhika and Singh, Navjyoti }, TITLE = {From Humour to Hatred: A Computational Analysis of Off-Colour Humour}, BOOKTITLE = {International Conference on Natural Language Processing and Chinese Computing}. YEAR = {2020}}
Off-colour humour is a category of humour which is considered by many to be in poor taste or overly vulgar. Most commonly, off-colour humour contains remarks on particular ethnic group or gender, violence, domestic abuse, acts concerned with sex, excessive swearing or profanity. Blue humour, dark humour and insult humour are types of off-colour humour. Blue and dark humour unlike insult humour are not outrightly insulting in nature but are often misclassified because of the presence of insults and harmful speech. As the primary contributions of this paper we provide an original data-set consisting of nearly 15,000 instances and a novel approach towards resolving the problem of separating dark and blue humour from offensive humour which is essential so that free-speech on the internet is not curtailed. Our experiments show that deep learning methods outperforms other n-grams based approaches like SVM’s, Naive Bayes and Logistic Regression by a large margin.
Question and Answer Pair Generation for Telugu Short Stories
Bommadi Meghana,Shreya Reddy Terupally,Radhika Mamidi
International Conference on Natural Language Processing., ICON, 2020
@inproceedings{bib_Ques_2020, AUTHOR = {Meghana, Bommadi and Terupally, Shreya Reddy and Mamidi, Radhika }, TITLE = {Question and Answer Pair Generation for Telugu Short Stories}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2020}}
Question Answer pair generation is a task that has been worked upon by multiple researchers in many languages. It has been a topic of interest due to its extensive uses in different fields like self assessment, academics, busi ness website FAQs etc. Many experiments were conducted on Question Answering pair generation in English, concentrating on basic Whquestions with a rulebased approach. We have built the first hybrid machine learning and rulebased solution in Telugu which is ef ficient for short stories or short passages in children’s books. Our work covers the funda mental question forms with the question types: adjective, yes/no, adverb, verb, when, where, whose, quotative, and quantitative(how many/ how much). We constructed rules for question generation using POS tags and UD tags along with linguistic information of the surrounding context of the word.
SUKHAN: Corpus of Hindi Shayaris annotated with Sentiment Polarity Information
Salil Aggarwal,Abhigyan Ghosh,Radhika Mamidi
International Conference on Natural Language Processing, ICNLP, 2020
@inproceedings{bib_SUKH_2020, AUTHOR = {Aggarwal, Salil and Ghosh, Abhigyan and Mamidi, Radhika }, TITLE = {SUKHAN: Corpus of Hindi Shayaris annotated with Sentiment Polarity Information}, BOOKTITLE = {International Conference on Natural Language Processing}. YEAR = {2020}}
Shayari is a form of poetry mainly popular in the Indian subcontinent, in which the poet expresses his emotions and feelings in a very poetic manner. It is one of the best ways to express our thoughts and opinions. Therefore, it is of prime importance to have an annotated corpus of Hindi shayaris for the task of sentiment analysis. In this paper, we introduce SUKHAN, a dataset consisting of Hindi shayaris along with sentiment polarity labels. To the best of our knowledge, this is the first corpus of Hindi shayaris annotated with sentiment polarity information. This corpus contains a total of 733 Hindi shayaris of various genres. Also, this dataset is of utmost value as all the annotation is done manually by five annotators and this makes it a very rich dataset for training purposes. This annotated corpus is also used to build baseline sentiment classification models using machine learning techniques.
gundapusunil at SemEval-2020 Task 9: Syntactic semantic lstm architecture for sentiment analysis of code-mixed data
Gundapu Sunil,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2020
@inproceedings{bib_gund_2020, AUTHOR = {Sunil, Gundapu and Mamidi, Radhika }, TITLE = {gundapusunil at SemEval-2020 Task 9: Syntactic semantic lstm architecture for sentiment analysis of code-mixed data}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2020}}
The phenomenon of mixing the vocabulary and syntax of multiple languages within the same utterance is called Code-Mixing. This is more evident in multilingual societies. In this paper, we have developed a system for SemEval 2020: Task 9 on Sentiment Analysis for Code-Mixed Social Media Text. Our system first generates two types of embeddings for the social media text. In those, the first one is character level embeddings to encode the character level information and to handle the out-of-vocabulary entries and the second one is FastText word embeddings for capturing morphology and semantics. These two embeddings were passed to the LSTM network and the system outperformed the baseline model.
Leveraging Multilingual Resources for Language Invariant Sentiment Analysis
ALLEN JOJO ANTONY,ARGHYA BHATTACHARYA,JAIPAL SINGH GOUD,Radhika Mamidi
Conference of the European Association for Machine Translation, EAMT, 2020
@inproceedings{bib_Leve_2020, AUTHOR = {ANTONY, ALLEN JOJO and BHATTACHARYA, ARGHYA and GOUD, JAIPAL SINGH and Mamidi, Radhika }, TITLE = {Leveraging Multilingual Resources for Language Invariant Sentiment Analysis}, BOOKTITLE = {Conference of the European Association for Machine Translation}. YEAR = {2020}}
Sentiment analysis is a widely researched NLP problem with state-of-the-art solutions capable of attaining human-like accuracies for various languages. However, these methods rely heavily on large amounts of labeled data or sentiment weighted language-specific lexical resources that are unavailable for low-resource languages. Our work attempts to tackle this data scarcity issue by introducing a neural architecture for language invariant sentiment analysis capable of leveraging various monolingual datasets for training without any kind of cross-lingual supervision. The proposed architecture attempts to learn language agnostic sentiment features via adversarial training on multiple resource-rich languages which can then be leveraged for inferring sentiment information at a sentence level on a low resource language. Our model outperforms the current state-of-the-art methods on the Multilingual Amazon Review Text Classification dataset [REF] and achieves significant performance gains over prior work on the low resource Sentiraama corpus [REF]. A detailed analysis of our research highlights the ability of our architecture to perform significantly well in the presence of minimal amounts of training data for low resource languages.
A Novel Annotation Schema for Conversational Humor: Capturing the Cultural Nuances in Kanyasulkam
Vaishnavi Pamulapati,GAYATRI PURIGILLA,Radhika Mamidi
Linguistic Annotation Workshop, LAW, 2020
@inproceedings{bib_A_No_2020, AUTHOR = {Pamulapati, Vaishnavi and PURIGILLA, GAYATRI and Mamidi, Radhika }, TITLE = {A Novel Annotation Schema for Conversational Humor: Capturing the Cultural Nuances in Kanyasulkam}, BOOKTITLE = {Linguistic Annotation Workshop}. YEAR = {2020}}
Humor research is a multifaceted field that has led to a better understanding of humor’s psychological effects and the development of different theories of humor. This paper’s main objective is to develop a hierarchical schema for a fine-grained annotation of Conversational Humor. Based on the Benign Violation Theory, the benignity or non-benignity of the interlocutor’s intentions is included within the framework. Under the categories mentioned above, in addition to different types of humor, the techniques utilized by these types are identified. Furthermore, a prominent play from Telugu, Kanyasulkam, is annotated to substantiate the work across cultures at multiple levels. The inter-annotator agreement is calculated to assess the accuracy and validity of the dataset. An in-depth analysis of the disagreement is performed to understand the subjectivity of humor better.
Conversational implicatures in English dialogue: Annotated dataset
ELIZABETH JASMI GEORGE,Radhika Mamidi
Procedia Computer Science, PCS, 2020
@inproceedings{bib_Conv_2020, AUTHOR = {GEORGE, ELIZABETH JASMI and Mamidi, Radhika }, TITLE = {Conversational implicatures in English dialogue: Annotated dataset}, BOOKTITLE = {Procedia Computer Science}. YEAR = {2020}}
Human dialogue often contains utterances having meanings entirely different from the sentences used and are clearly understood by the interlocutors. But in human-computer interactions, the machine fails to understand the implicated meaning unless it is trained with a dataset containing the implicated meaning of an utterance along with the utterance and the context in which it is uttered. In linguistic terms, conversational implicatures are the meanings of the speaker’s utterance that are not part of what is explicitly said. In this paper, we introduce a dataset of dialogue snippets with three constituents, which are the context, the utterance, and the implicated meanings. These implicated meanings are the conversational implicatures. The utterances are collected by transcribing from listening comprehension sections of English tests like TOEFL (Test of English as a Foreign Language) as well as scraping dialogues from movie scripts available on IMSDb (Internet Movie Script Database). The utterances are manually annotated with implicatures
SmokPro: Towards Tobacco Product Identification in Social Media Text
Venkata Himakar Yanamandra,Kartikey Pant,Radhika Mamidi
European Conference on Information Retrieval, ECIR, 2020
@inproceedings{bib_Smok_2020, AUTHOR = {Yanamandra, Venkata Himakar and Pant, Kartikey and Mamidi, Radhika }, TITLE = {SmokPro: Towards Tobacco Product Identification in Social Media Text}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2020}}
In this work, we explore the fine-grained classification of tweets involving tobacco focused on identifying tobacco products. We release the SmokPro dataset, along with an extensible method of labeling the tweets through a comprehensive annotation schema. We then perform benchmarking experiments using state-of-the-art text classification models, exploiting contextual word embeddings and achieve F1 scores as high as 0.971, hence showing the efficacy of the dataset and the suitability of the models for the task.
Towards Detection of Subjective Bias using Contextualized Word Embeddings
Kartikey Pant,Tanvi Dadu,Radhika Mamidi
Technical Report, arXiv, 2020
@inproceedings{bib_Towa_2020, AUTHOR = {Pant, Kartikey and Dadu, Tanvi and Mamidi, Radhika }, TITLE = {Towards Detection of Subjective Bias using Contextualized Word Embeddings}, BOOKTITLE = {Technical Report}. YEAR = {2020}}
Subjective bias detection is critical for applications like propaganda detection, content recommendation, sentiment analysis, and bias neutralization. This bias is introduced in natural language via inflammatory words and phrases, casting doubt over facts, and presupposing the truth. In this work, we perform comprehensive experiments for detecting subjective bias using BERT-based models on the Wiki Neutrality Corpus(WNC). The dataset consists of 360𝑘 labeled instances, from Wikipedia edits that remove various instances of the bias. We further propose BERT-based ensembles that outperform state-of-the-art methods like 𝐵𝐸𝑅𝑇𝑙𝑎𝑟𝑔𝑒 by a margin of 5.6 F1 score
Towards Detection of Subjective Bias using Contextualized Word Embeddings
Kartikey Pant,Tanvi Dadu,Radhika Mamidi
International Conference on World wide web, WWW, 2020
@inproceedings{bib_Towa_2020, AUTHOR = {Pant, Kartikey and Dadu, Tanvi and Mamidi, Radhika }, TITLE = {Towards Detection of Subjective Bias using Contextualized Word Embeddings}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2020}}
Subjective bias detection is critical for applications like propaganda detection, content recommendation, sentiment analysis, and bias neutralization. This bias is introduced in natural language via inflammatory words and phrases, casting doubt over facts, and presupposing the truth. In this work, we perform comprehensive experiments for detecting subjective bias using BERT-based models on the Wiki Neutrality Corpus(WNC). The dataset consists of 360𝑘 labeled instances, from Wikipedia edits that remove various instances of the bias. We further propose BERT-based ensembles that outperform state-of-the-art methods like 𝐵𝐸𝑅𝑇𝑙𝑎𝑟𝑔𝑒 by a margin of 5.6 F1 score.
Dataset Creation and Evaluation of Aspect Based Sentiment Analysis in Telugu, a Low Resource Language
Regatte Yashwanth Reddy,Gangula Rama Rohit Reddy,Radhika Mamidi
International Conference on Language Resources and Evaluation, LREC, 2020
@inproceedings{bib_Data_2020, AUTHOR = {Reddy, Regatte Yashwanth and Reddy, Gangula Rama Rohit and Mamidi, Radhika }, TITLE = {Dataset Creation and Evaluation of Aspect Based Sentiment Analysis in Telugu, a Low Resource Language}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2020}}
In recent years, sentiment analysis has gained popularity as it is essential to moderate and analyse the information across the internet. It has various applications like opinion mining, social media monitoring, and market research. Aspect Based Sentiment Analysis (ABSA) is an area of sentiment analysis which deals with sentiment at a finer level. ABSA classifies sentiment with respect to each aspect to gain greater insights into the sentiment expressed. Significant contributions have been made in ABSA, but this progress is limited only to a few languages with adequate resources. Telugu lags behind in this area of research despite being one of the most spoken languages in India and an enormous amount of data being created each day. In this paper, we create a reliable resource for aspect based sentiment analysis in Telugu. The data is annotated for three tasks namely Aspect Term Extraction, Aspect Polarity Classification and Aspect Categorisation. Further, we develop baselines for the tasks using deep learning methods demonstrating the reliability and usefulness of the resource.
Annotated Corpus for Sentiment Analysis in Odia Language
GAURAV MOHANTY,PRUTHWIK MISHRA,Radhika Mamidi
International Conference on Language Resources and Evaluation, LREC, 2020
@inproceedings{bib_Anno_2020, AUTHOR = {MOHANTY, GAURAV and MISHRA, PRUTHWIK and Mamidi, Radhika }, TITLE = {Annotated Corpus for Sentiment Analysis in Odia Language}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2020}}
Given the lack of an annotated corpus of non-traditional Odia literature which serves as the standard when it comes sentiment analysis, we have created an annotated corpus of Odia sentences and made it publicly available to promote research in the field. Secondly, in order to test the usability of currently available Odia sentiment lexicon, we experimented with various classifiers by training and testing on the sentiment annotated corpus while using identified affective words from the same as features. Annotation and classification are done at sentence level as the usage of sentiment lexicon is best suited to sentiment analysis at this level. The created corpus contains 2045 Odia sentences from news domain annotated with sentiment labels using a well-defined annotation scheme. An inter-annotator agreement score of 0.79 is reported for the corpus.
Manovaad: A Novel Approach to Event Oriented Corpus Creation Capturing Subjectivity and Focus
V A Lalitha Kameswari,Radhika Mamidi
International Conference on Language Resources and Evaluation, LREC, 2020
@inproceedings{bib_Mano_2020, AUTHOR = {Kameswari, V A Lalitha and Mamidi, Radhika }, TITLE = {Manovaad: A Novel Approach to Event Oriented Corpus Creation Capturing Subjectivity and Focus}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2020}}
In today’s era of globalisation, the increased outreach for every event across the world has been leading to conflicting opinions, arguments and disagreements, often reflected in print media and online social platforms. It is necessary to distinguish factual observations from personal judgements in news, as subjectivity in reporting can influence the audience’s perception of reality. Several studies conducted on the different styles of reporting in journalism are essential in understanding phenomena such as media bias and multiple interpretations of the same event. This domain finds applications in fields such as Media Studies, Discourse Analysis, Information Extraction, Sentiment Analysis, and Opinion Mining. We present an event corpus “Manovaad-v1.0” consisting of 1035 news articles corresponding to 65 events from 3 levels of newspapers viz., Local, National, and International levels. Using this novel format, we correlate the trends in the degree of subjectivity with the geographical closeness of reporting using a Bi-RNN model. We also analyse the role of background and focus in event reporting and capture the focus shift patterns within a global discourse structure for an event. We do this across different levels of reporting and compare the results with the existing work on discourse processing.
A SentiWordNet Strategy for Curriculum Learning in Sentiment Analysis
V ANVESH RAO,KAVERI ANURANJANA,Radhika Mamidi
International Conference on Applications of Natural Language to Information Systems, NLBD, 2020
@inproceedings{bib_A_Se_2020, AUTHOR = {RAO, V ANVESH and ANURANJANA, KAVERI and Mamidi, Radhika }, TITLE = {A SentiWordNet Strategy for Curriculum Learning in Sentiment Analysis}, BOOKTITLE = {International Conference on Applications of Natural Language to Information Systems}. YEAR = {2020}}
Curriculum Learning (CL) is the idea that learning on a training set sequenced or ordered in a manner where samples range from easy to difficult, results in an increment in performance over otherwise random ordering. The idea parallels cognitive science’s theory of how human brains learn, and that learning a difficult task can be made easier by phrasing it as a sequence of easy to difficult tasks. This idea has gained a lot of traction in machine learning and image processing for a while and recently in Natural Language Processing (NLP). In this paper, we apply the ideas of curriculum learning, driven by SentiWordNet in a sentiment analysis setting. In this setting, given a text segment, our aim is to extract its sentiment or polarity. SentiWordNet is a lexical resource with sentiment polarity annotations. By comparing performance with other curriculum strategies and with no curriculum, the effectiveness of the proposed strategy is presented. Convolutional, Recurrence, and Attention-based architectures are employed to assess this improvement. The models are evaluated on a standard sentiment dataset, Stanford Sentiment Treebank.
BERT-based Ensembles for Modeling Disclosure and Support in Conversational Social Media Text
Tanvi Dadu,Kartikey Pant,Radhika Mamidi
CEUR Workshop Proceedings, CEUR, 2020
@inproceedings{bib_BERT_2020, AUTHOR = {Dadu, Tanvi and Pant, Kartikey and Mamidi, Radhika }, TITLE = {BERT-based Ensembles for Modeling Disclosure and Support in Conversational Social Media Text}, BOOKTITLE = {CEUR Workshop Proceedings}. YEAR = {2020}}
There is a growing interest in understanding how humans initiate and hold conversations. The affective understanding of conversations focuses on the problem of how speakers use emotions to react to a situation and to each other. In the CL-Aff Shared Task, the organizers released Get it #OffMyChest dataset, which contains Reddit comments from casual and confessional conversations, labeled for their disclosure and supportiveness characteristics. In this paper, we introduce a predictive ensemble model exploiting the finetuned contextualized word embeddings, RoBERTa and ALBERT. We show that our model outperforms the base models in all considered metrics, achieving an improvement of 3% in the F1 score. We further conduct statistical analysis and outline deeper insights into the given dataset while providing a new characterization of impact for the dataset.
Enhancing Bias Detection in Political News Using Pragmatic Presupposition
V A Lalitha Kameswari,Dama Sravani,Radhika Mamidi
International Joint Conference on Natural Language Processing Workshop, IJCNLP-W, 2020
@inproceedings{bib_Enha_2020, AUTHOR = {Kameswari, V A Lalitha and Sravani, Dama and Mamidi, Radhika }, TITLE = {Enhancing Bias Detection in Political News Using Pragmatic Presupposition}, BOOKTITLE = {International Joint Conference on Natural Language Processing Workshop}. YEAR = {2020}}
Usage of presuppositions in social media and news discourse can be a powerful way to influence the readers as they usually tend to not examine the truth value of the hidden or indirectly expressed information. Fairclough and Wodak (1997) discuss presupposition at a discourse level where some implicit claims are taken for granted in the explicit meaning of a text or utterance. From the Gricean perspective, the presuppositions of a sentence determine the class of contexts in which the sentence could be felicitously uttered. This paper aims to correlate the type of knowledge presupposed in a news article to the bias present in it. We propose a set of guidelines to identify various kinds of presuppositions in news articles and present a dataset consisting of 1050 articles which are annotated for bias (positive, negative or neutral) and the magnitude of presupposition. We introduce a supervised classification approach for detecting bias in political news which significantly outperforms the existing systems.
Detecting Sarcasm in Conversation Context Using Transformer-Based Models
ADITHYA AVVARU,Sanath Vobilisetty,Radhika Mamidi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2020
@inproceedings{bib_Dete_2020, AUTHOR = {AVVARU, ADITHYA and Vobilisetty, Sanath and Mamidi, Radhika }, TITLE = {Detecting Sarcasm in Conversation Context Using Transformer-Based Models}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2020}}
Sarcasm detection, regarded as one of the subproblems of sentiment analysis, is a very typical task because the introduction of sarcastic words can flip the sentiment of the sentence itself. To date, many research works revolve around detecting sarcasm in one single sentence and there is very limited research to detect sarcasm resulting from multiple sentences. Current models used Long Short Term Memory (Hochreiter and Schmidhuber, 1997) (LSTM) variants with or without attention to detect sarcasm in conversations. We showed that the models using state-of-the-art Bidirectional Encoder Representations from Transformers (Devlin et al., 2018) (BERT), to capture syntactic and semantic information across conversation sentences, performed better than the current models. Based on the data analysis, we estimated that the number of sentences in the conversation that can contribute to the sarcasm and the results agrees to this estimation. We also perform a comparative study of our different versions of BERT-based model with other variants of LSTM model and XLNet (Yang et al., 2019) (both using the estimated number of conversation sentences) and find out that BERT-based models outperformed them.
Samajh-Boojh: A Reading Comprehension system in Hindi
Shalaka Vaidya,Hiranmai Sri Adibhatla,Radhika Mamidi
International Conference on Natural Language Processing., ICON, 2019
@inproceedings{bib_Sama_2019, AUTHOR = {Vaidya, Shalaka and Adibhatla, Hiranmai Sri and Mamidi, Radhika }, TITLE = {Samajh-Boojh: A Reading Comprehension system in Hindi}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2019}}
This paper presents a novel approach designed to answer questions on a reading comprehension passage. It is an end-to-end system which first focuses on comprehending the given passage wherein it converts unstructured passage into a structured data and later proceeds to answer the questions related to the passage using solely the aforementioned structured data. To the best of our knowledge, the proposed design is first of its kind which accounts for entire process of comprehending the passage and then answering the questions associated with the passage. The comprehension stage converts the passage into a Discourse Collection that comprises of the relation shared amongst logical sentences in given passage along with the key characteristics of each sentence. This design has its applications in academic domain , query comprehension in speech systems among others.
Samvaadhana : A Telugu Dialogue System in Hospital Domain
Suma Reddy Duggenpudi,Kusampudi Siva Subrahamanyam Varma,Radhika Mamidi
Workshop on Deep Learning Approaches for Low-Resource NLP, DeepLo, 2019
@inproceedings{bib_Samv_2019, AUTHOR = {Duggenpudi, Suma Reddy and Varma, Kusampudi Siva Subrahamanyam and Mamidi, Radhika }, TITLE = {Samvaadhana : A Telugu Dialogue System in Hospital Domain}, BOOKTITLE = {Workshop on Deep Learning Approaches for Low-Resource NLP}. YEAR = {2019}}
In this paper, a dialogue system for Hospital domain in Telugu, which is a resource-poor Dravidian language, has been built. It handles various hospital and doctor related queries. The main aim of this paper is to present an approach for modelling a dialogue system in a resource-poor language by combining linguistic and domain knowledge. Focusing on the question answering aspect of the dialogue system, we identified Question Classification and Query Processing as the two most important parts of the dialogue system. Our method combines deep learning techniques for question classification and computational rule-based analysis for query processing. Human evaluation of the system has been performed as there is no automated evaluation tool for dialogue systems in Telugu. Our system achieves a high overall rating along with a significantly accurate context-capturing method as shown in the results.
Question Answering on Structured Data using NLIDB Approach
Vishal Wudaru,Nikhil Koditala,Aruneswara Reddy,Radhika Mamidi
International Conference on Advanced Computing & Communication Systems, ICACCS, 2019
@inproceedings{bib_Ques_2019, AUTHOR = {Wudaru, Vishal and Koditala, Nikhil and Reddy, Aruneswara and Mamidi, Radhika }, TITLE = {Question Answering on Structured Data using NLIDB Approach}, BOOKTITLE = {International Conference on Advanced Computing & Communication Systems}. YEAR = {2019}}
In this paper we present our work in building Natural Language Interface to Database (NLIDB) system using Intermediate query approach. This approach is demonstrated using Movie domain chatbot and can also be extended to different domains. The need of NLIDB System has increased in this fast paced world where more number of users are accessing databases through their Smart phones and web browsers. NLIDB System maps user’s Natural Language query to database query allowing user to extract information without any prior experience with databases. Results obtained are very promising and can tackle most of the user queries regarding target database.
Affect in Tweets Using Experts Model
OOTA SUBBA REDDY,ADITHYA AVVARU,Mounika Marreddy,Radhika Mamidi
Pacific Asia Conference on Language, Information and Computation, PACLIC, 2019
@inproceedings{bib_Affe_2019, AUTHOR = {REDDY, OOTA SUBBA and AVVARU, ADITHYA and Marreddy, Mounika and Mamidi, Radhika }, TITLE = {Affect in Tweets Using Experts Model}, BOOKTITLE = {Pacific Asia Conference on Language, Information and Computation}. YEAR = {2019}}
Estimating the intensity of emotion has gained significance as modern textual inputs in potential applications like social media, eretail markets, psychology, advertisements etc., carry a lot of emotions, feelings, expressions along with its meaning. However, the approaches of traditional sentiment analysis primarily focuses on classifying the sentiment in general (positive or negative) or at an aspect level(very positive, low negative, etc.)and cannot exploit the intensity information. Moreover, automatically identifying emotions like anger, fear, joy, sadness, disgust etc., from text introduces challenging scenarios where single tweet may contain multiple emotions with different intensities and some emotions may even co-occur in some of the tweets. In this paper, we propose an architecture, Experts Model, inspired from the standard Mixture of Experts (MoE) model. The key idea here is each expert learns different sets of features from the feature vector which helps in better emotion detection from the tweet. We compared the results of our Experts Model with both baseline results and top five performers of SemEval-2018 Task-1, Affect in Tweets (AIT). The experimental results show that our proposed approach deals with the emotion detection problem and stands at top-5 results.
Stance Detection in Code-Mixed Hindi-English Social Media Data using Multi-Task Learning
SANE SUSHMITHA REDDY,Suraj Tripathi,KOUSHIK REDDY SANE,Radhika Mamidi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_Stan_2019, AUTHOR = {REDDY, SANE SUSHMITHA and Tripathi, Suraj and SANE, KOUSHIK REDDY and Mamidi, Radhika }, TITLE = {Stance Detection in Code-Mixed Hindi-English Social Media Data using Multi-Task Learning}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
Social media sites like Facebook, Twitter, and other microblogging forums have emerged as a platform for people to express their opinions and views on different issues and events. It is often observed that people tend to take a stance; in favor, against or neutral towards a particular topic. The task of assessing the stance taken by the individual became significantly important with the emergence in the usage of online social platforms. Automatic stance detection system understands the user’s stance by analyzing the standalone texts against a target entity. Due to the limited contextual information a single sentence provides, it is challenging to solve this task effectively. In this paper, we introduce a Multi-Task Learning (MTL) based deep neural network architecture for automatically detecting stance present in the code-mixed corpus. We apply our approach on Hindi-English code-mixed corpus against the target entity - “Demonetisation.” Our best model achieved the result with a stance prediction accuracy of 63.2% which is a 4.5% overall accuracy improvement compared to the current supervised classification systems developed using the benchmark dataset for code-mixed data stance detection.
Deep Learning Techniques for Humor Detection in Hindi-English Code-Mixed Tweets
SANE SUSHMITHA REDDY,Suraj Tripathi,KOUSHIK REDDY SANE,Radhika Mamidi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_Deep_2019, AUTHOR = {REDDY, SANE SUSHMITHA and Tripathi, Suraj and SANE, KOUSHIK REDDY and Mamidi, Radhika }, TITLE = {Deep Learning Techniques for Humor Detection in Hindi-English Code-Mixed Tweets}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
We propose bilingual word embeddings based on word2vec and fastText models (CBOW and Skip-gram) to address the problem of Humor detection in Hindi-English code-mixed tweets in combination with deep learning architectures. We focus on deep learning approaches which are not widely used on code-mixed data and analyzed their performance by experimenting with three different neural network models. We propose convolution neural network (CNN) and bidirectional long-short term memory (biLSTM) (with and without Attention) models which take the generated bilingual embeddings as input. We make use of Twitter data to create bilingual word embeddings. All our proposed architectures outperform the state-of-the-art results, and Attention based bidirectional LSTM model achieved an accuracy of 73.6% which is an increment of more than 4% compared to the current stateof-the-art results.
Hindi Question Generation Using Dependency Structures
KAVERI ANURANJANA,V ANVESH RAO,Radhika Mamidi
Technical Report, arXiv, 2019
@inproceedings{bib_Hind_2019, AUTHOR = {ANURANJANA, KAVERI and RAO, V ANVESH and Mamidi, Radhika }, TITLE = {Hindi Question Generation Using Dependency Structures }, BOOKTITLE = {Technical Report}. YEAR = {2019}}
Hindi question answering systems suffer from a lack of data. To address the same, this paper presents an approach towards automatic question generation. We present a rule-based system for question generation in Hindi by formalizing question transformation methods based on karakadependency theory. We use a Hindi dependency parser to mark the karaka roles and use IndoWordNet a Hindi ontology to detect the semantic category of the karaka role heads to generate the interrogatives. We analyze how one sentence can have multiple generations from the same karaka role’s rule. The generations are manually annotated by multiple annotators on a semantic and syntactic scale for evaluation. Further, we constrain our generation with the help of various semantic and syntactic filters so as to improve the generation quality. Using these methods, we are able to generate diverse questions, significantly more than number of sentences fed to the system.
Evaluating the Combination of Word Embeddings with Mixture of Experts and Cascading gcForest in Identifying Sentiment Polarity
Mounika Marreddy,OOTA SUBBA REDDY,RADHA AGARWAL,Radhika Mamidi
International Workshop on Issues of Sentiment Discovery and Opinion Mining, WISDOM, 2019
@inproceedings{bib_Eval_2019, AUTHOR = {Marreddy, Mounika and REDDY, OOTA SUBBA and AGARWAL, RADHA and Mamidi, Radhika }, TITLE = {Evaluating the Combination of Word Embeddings with Mixture of Experts and Cascading gcForest in Identifying Sentiment Polarity}, BOOKTITLE = {International Workshop on Issues of Sentiment Discovery and Opinion Mining}. YEAR = {2019}}
Neural word embeddings have been able to deliver impressive results in many Natural Language Processing tasks. The quality of the word embedding determines the performance of a supervised model. However, choosing the right set of word embeddings for a given dataset is a major challenging task for enhancing the results. In this paper, we have evaluated neural word embeddings on sentiment analysis task in two steps: (i) proposed a mixture of classification experts (MoCE) model for sentiment classification task, (ii) to compare and improve the classification accuracy by different combination of word embedding as first level of features and pass it to cascade model inspired by gcForest for extracting diverse features. We argue that in the first step, each expert learns a certain positive or negative examples corresponding to its category and in the second step resulting features on a given task (polarity identification) can achieve competitive performance with state-ofthe-art methods in terms of accuracy, precision and recall using gcForest.
Detecting Political Bias in News Articles Using Headline Attention
Gangula Rama Rohit Reddy,Suma Reddy Duggenpudi,Radhika Mamidi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_Dete_2019, AUTHOR = {Reddy, Gangula Rama Rohit and Duggenpudi, Suma Reddy and Mamidi, Radhika }, TITLE = {Detecting Political Bias in News Articles Using Headline Attention}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
Language is a powerful tool which can be used to state the facts as well as express our views and perceptions. Most of the times, we find a subtle bias towards or against someone or something. When it comes to politics, media houses and journalists are known to create bias by shrewd means such as misinterpreting reality and distorting viewpoints towards some parties. This misinterpretation on a large scale can lead to the production of biased news and conspiracy theories. Automating bias detection in newspaper articles could be a good challenge for research in NLP. We proposed a headline attention network for this bias detection. Our model has two distinctive characteristics: (i) it has a structure that mirrors a person’s way of reading a news article (ii) it has attention mechanism applied on the article based on its headline, enabling it to attend to more critical content to predict bias. As the required datasets were not available, we created a dataset comprising of 1329 news articles collected from various Telugu newspapers and marked them for bias towards a particular political party. The experiments conducted on it demonstrated that our model outperforms various baseline methods by a substantial margin.
SmokEng: Towards Fine-grained Classification of Tobacco-related Social Media Text
Kartikey Pant,Venkata Himakar Yanamandra,Alok Debnath,Radhika Mamidi
Conference on Empirical Methods in Natural Language Processing, EMNLP, 2019
@inproceedings{bib_Smok_2019, AUTHOR = {Pant, Kartikey and Yanamandra, Venkata Himakar and Debnath, Alok and Mamidi, Radhika }, TITLE = {SmokEng: Towards Fine-grained Classification of Tobacco-related Social Media Text}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2019}}
Contemporary datasets on tobacco consumption focus on one of two topics, either public health mentions and disease surveillance, or sentiment analysis on topical tobacco products and services. However, two primary considerations are not accounted for, the language of the demographic affected and a combination of the topics mentioned above in a finegrained classification mechanism. In this paper, we create a dataset of 3144 tweets, which are selected based on the presence of colloquial slang related to smoking and analyze it based on the semantics of the tweet. Each class is created and annotated based on the content of the tweets such that further hierarchical methods can be easily applied. Further, we prove the efficacy of standard text classification methods on this dataset, by designing experiments which do both binary as well as multi-class classification. Our experiments tackle the identification of either a specific topic (such as tobacco product promotion), a general mention (cigarettes and related products) or a more fine-grained classification.This methodology paves the way for further analysis, such as understanding sentiment or style, which makes this dataset a vital contribution to both disease surveillance and tobacco use research.
Towards Computing Inferences from English News Headlines
ELIZABETH JASMI GEORGE,Radhika Mamidi
Conference of the Pacific Association for Computational Linguistics, PACLING, 2019
@inproceedings{bib_Towa_2019, AUTHOR = {GEORGE, ELIZABETH JASMI and Mamidi, Radhika }, TITLE = {Towards Computing Inferences from English News Headlines}, BOOKTITLE = {Conference of the Pacific Association for Computational Linguistics}. YEAR = {2019}}
Newspapers are a popular form of written discourse, read by many people, thanks to the novelty of the information provided by the news content in it. A headline is the most widely read part of any newspaper due to its appearance in a bigger font and sometimes in colour print. In this paper, we suggest and implement a method for computing inferences from English news headlines, excluding the information from the context in which the headlines appear. This method attempts to generate the possible assumptions a reader formulates in mind upon reading a fresh headline. The generated inferences could be useful for assessing the impact of the news headline on readers including children. The understandability of the current state of social affairs depends greatly on the assimilation of the headlines. As the inferences that are independent of the context depend mainly on the syntax of the headline, dependency trees of headlines are used in this approach, to find the syntactical structure of the headlines and to compute inferences out of them.
Samvaadhana : A Telugu Dialogue System in Hospital Domain
Suma Reddy Duggenpudi,Kusampudi Siva Subrahamanyam Varma,Radhika Mamidi
International Joint Conference on Natural Language Processing Workshop, IJCNLP-W, 2019
@inproceedings{bib_Samv_2019, AUTHOR = {Duggenpudi, Suma Reddy and Varma, Kusampudi Siva Subrahamanyam and Mamidi, Radhika }, TITLE = {Samvaadhana : A Telugu Dialogue System in Hospital Domain}, BOOKTITLE = {International Joint Conference on Natural Language Processing Workshop}. YEAR = {2019}}
In this paper, a dialogue system for Hospital domain in Telugu, which is a resource-poor Dravidian language, has been built. It handles various hospital and doctor related queries. The main aim of this paper is to present an approach for modelling a dialogue system in a resource-poor language by combining linguistic and domain knowledge. Focusing on the question answering aspect of the dialogue system, we identified Question Classification and Query Processing as the two most important parts of the dialogue system. Our method combines deep learning techniques for question classification and computational rule-based analysis for query processing. Human evaluation of the system has been performed as there is no automated evaluation tool for dialogue systems in Telugu. Our system achieves a high overall rating along with a significantly accurate context-capturing method as shown in the results.
Anaphora Resolution in Dialogue Systems for South Asian Languages
Vinay Annam,Nikhil Koditala,Radhika Mamidi
Technical Report, arXiv, 2019
@inproceedings{bib_Anap_2019, AUTHOR = {Annam, Vinay and Koditala, Nikhil and Mamidi, Radhika }, TITLE = {Anaphora Resolution in Dialogue Systems for South Asian Languages}, BOOKTITLE = {Technical Report}. YEAR = {2019}}
Anaphora resolution is a challenging task which has been the interest of NLP researchers for a long time. Traditional resolution techniques like eliminative constraints and weighted preferences were successful in many languages. However, they are ineffective in free word order languages like most South Asian languages. Heuristic and rule-based techniques were typical in these languages, which are constrained to context and domain. In this paper, we venture a new strategy using neural networks for resolving anaphora in human-human dialogues. The architecture chiefly consists of three components, a shallow parser for extracting features, a feature vector generator which produces the word embeddings, and a neural network model which will predict the antecedent mention of an anaphora. The system has been trained and tested on Telugu conversation corpus we generated. Given the advantage of the semantic information in word embeddings and appending actor, gender, number, person and part of plural features the model has reached an F1-score of 86.
Unsupervised Approach for Monitoring Satire on Social Media
Parth Patekar,Aman Sinha,Radhika Mamidi
Forum for Information Retrieval Evaluation, FIRE, 2019
@inproceedings{bib_Unsu_2019, AUTHOR = {Patekar, Parth and Sinha, Aman and Mamidi, Radhika }, TITLE = {Unsupervised Approach for Monitoring Satire on Social Media}, BOOKTITLE = {Forum for Information Retrieval Evaluation}. YEAR = {2019}}
The content on social media now-a-days includes a huge number of messages both in textual and visual forms that are satirical in nature. Satire in the form of irony, sarcasm, and ridicule to a person, is depicted by memes on social media. Satire detection in text is an active area of research, but in the visual domain it is relatively less explored. The objective of our work is detection of satire in images taken from the popular photo-sharing platform - Flickr. Traditional methods for visual satire detection are based on supervised learning which has a necessary requirement of annotating the data which is a tedious task. In our work, to address this issue, we propose a novel, unsupervised approach which leverages the visual semantics of the images. We provide a study of clustering methods, where the difference between visual semantics of the two classes - satirical and non-satirical - becomes the basis for classification of visual content. Here, we suggest an autoencoder based clustering framework to effectively combine embedded feature learning and clustering assignments for detection of satire.
Word Level Language Identification in English Telugu Code Mixed Data
Gundapu Sunil,Radhika Mamidi
Pacific Asia Conference on Language, Information and Computation, PACLIC, 2018
@inproceedings{bib_Word_2018, AUTHOR = {Sunil, Gundapu and Mamidi, Radhika }, TITLE = {Word Level Language Identification in English Telugu Code Mixed Data}, BOOKTITLE = {Pacific Asia Conference on Language, Information and Computation}. YEAR = {2018}}
In a multilingual or sociolingual configuration Intra-sentential Code Switching (ICS) or Code Mixing (CM) is frequently observed nowadays. In the world most of the people know more than one language. The CM usage is especially apparent in social media platforms. Moreover, ICS is particularly significant in the context of technology, health and law where conveying the upcoming developments are difficult in ones native language. In applications like dialog systems, machine translation, semantic parsing, shallow parsing, etc. CM and Code Switching pose serious challenges. To do any further advancement in code-mixed data, the necessary step is Language Identification. So, in this paper we present a study of various models - Nave Bayes Classifier, Random Forest Classifier, Conditional Random Field (CRF) and Hidden Markov Model (HMM) for Language Identification in English - Telugu Code Mixed Data. Considering the paucity of resources in code mixed languages, we proposed CRF model and HMM model for word level language identification. Our best performing system is CRF-based with an f1-score of 0.91.
Predicting the Genre and Rating of a Movie Based on its Synopsis
BATTU VARSHIT,BATCHU VENKAT VISHAL,Gangula Rama Rohit Reddy,DAKANNAGARI MOHANA MURALI KRISHNA REDDY,Radhika Mamidi
Pacific Asia Conference on Language, Information and Computation, PACLIC, 2018
@inproceedings{bib_Pred_2018, AUTHOR = {VARSHIT, BATTU and VISHAL, BATCHU VENKAT and Reddy, Gangula Rama Rohit and REDDY, DAKANNAGARI MOHANA MURALI KRISHNA and Mamidi, Radhika }, TITLE = {Predicting the Genre and Rating of a Movie Based on its Synopsis}, BOOKTITLE = {Pacific Asia Conference on Language, Information and Computation}. YEAR = {2018}}
Movies are one of the most prominent means of entertainment. The widespread use of the Internet in recent times has led to large volumes of data related to movies being generated and shared online. People often prefer to express their views online in English as compared to other local languages. This leaves us with a very little amount of data in languages apart from English to work on. To overcome this, we created the Multi-Language Movie Review Dataset (MLMRD). The dataset consists of genre, rating, and synopsis of a movie across multiple languages, namely Hindi, Telugu, Tamil, Malayalam, Korean, French, and Japanese. The genre of a movie can be identified by its synopsis. Though the rating of a movie may depend on multiple factors like the performance of actors, screenplay, direction etc but in most of the cases, synopsis plays a crucial role in the movie rating. In this work, we provide various model architectures that can be used to predict the genre and the rating of a movie across various languages present in our dataset based on the synopsis
Sad or Glad? Corpus Creation for Odia Poetry with Sentiment Polarity Information
GAURAV MOHANTY,PRUTHWIK MISHRA,Radhika Mamidi
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2018
@inproceedings{bib_Sad__2018, AUTHOR = {MOHANTY, GAURAV and MISHRA, PRUTHWIK and Mamidi, Radhika }, TITLE = {Sad or Glad? Corpus Creation for Odia Poetry with Sentiment Polarity Information}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2018}}
Resource poor languages, like Odia, inherently lack the necessary resources and tools for the task of sentiment analysis to give promising results. With more user-generated raw data readily available today, it is of prime importance to have annotated corpora from various domains. This paper is a first attempt towards building an annotated corpus of Odia poetry with sentiment labels. The annotated corpus is further used for sentiment classification using machine learning techniques in order to establish a baseline. Stylistic variations and structural differences between poetic and non-poetic texts make the task of sentiment classification challenging for the former. Using the annotated corpus of poems, we obtained comparable accuracies across various classification models. Linear-SVM outperformed other classifiers with a macro F1- Score of 0.68. The annotated corpus contains a total of 730 Odia Poems of various genres with a vocabulary of more than 23k words. Fleiss Kappa score of 0.83 was obtained which corresponds to near perfect agreement among the annotators.
Impact of Translation on Sentiment Analysis: A Case-Study on Telugu Reviews
Gangula Rama Rohit Reddy,Radhika Mamidi
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2018
@inproceedings{bib_Impa_2018, AUTHOR = {Reddy, Gangula Rama Rohit and Mamidi, Radhika }, TITLE = {Impact of Translation on Sentiment Analysis: A Case-Study on Telugu Reviews}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2018}}
Sentiment analysis research has predominantly been on English texts. There exists many sentiment resources for English but very less exist for other languages. To improve sentiment analysis in a low resource language, sentiment labeled corpora are translated from English into the focus language and use them as additional resources for sentiment analysis research in the focus language [3]. But when text is translated from one language into another, sentiment is preserved to varying degrees. In this paper, we use product and book reviews in English as stand-in for source language text and determine loss in sentiment and sentiment predictability when they are translated into Telugu (a low resource South Asian language), manually and automatically. For this purpose, we use manually and automatically determined sentiment labels of the English text as a benchmark. We show that sentiment analysis of Telugu manual translations of English text produces competitive results w.r.t English sentiment analysis. We discover that even though machine translation ignificantly reduces the human ability to recover sentiment, automatic sentiment systems are still able to capture sentiment information from the translations in certain cases. In the process, we created a Telugu-English parallel corpus that is independently annotated for sentiment using a 5-value scale by Telugu and English speakers. We also created a Telugu lexicon annotated at both sentiment and emphasis level.
What is this Song About?: Identification of Keywords in Bollywood Lyrics
G DRUSHTI APOORVA,Kritik Mathur,Priyansh Agrawal,Radhika Mamidi
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2018
@inproceedings{bib_What_2018, AUTHOR = {APOORVA, G DRUSHTI and Mathur, Kritik and Agrawal, Priyansh and Mamidi, Radhika }, TITLE = {What is this Song About?: Identification of Keywords in Bollywood Lyrics}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2018}}
Keywords of a document are a representative of its content, and it helps to have meaningful words to facilitate search and organization of documents. Hence, finding methods that can automatically identify keywords in a document is very important as manual processes for this is very cumbersome and error-prone. If this task is accomplished for song lyrics, it has varied applications such as recommendation systems and digital music library management. This work proposes and compares methods to identify keywords from lyrics of Bollywood songs. We use a collection of lyrics of 1055 Bollywood songs, all written in the Devanagari script. Experiments include looking at the spatial distribution of the terms, their occurrence in a certain context or position, and using WordNet to generate keywords not present in the document. Validation was done by human annotators by providing a score to each method based on the results obtained on a subset of the data. We also used Latent Dirichlet Allocation and Latent Semantic Indexing to validate the results, as further explained in the paper.
Towards better Sentence Classification for Morphologically Rich Languages
Tummalapalli Madhuri,Manoj Chinnakotla,Radhika Mamidi
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2018
@inproceedings{bib_Towa_2018, AUTHOR = {Madhuri, Tummalapalli and Chinnakotla, Manoj and Mamidi, Radhika }, TITLE = {Towards better Sentence Classification for Morphologically Rich Languages}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2018}}
Many methods have been developed for various sentence classification tasks for English, which usually exploit linguistic resources like parsers or rely on the large amount of annotated or unannotated data, making it difficult to adapt them to other languages. In this paper, we present an evaluation of popular deep learning methods for sentence classification on the morphologically rich Indian languages, specifically, Hindi and Telugu. For this purpose, we also created a question classification dataset for Hindi, by translating the TREC-UIUC dataset. We show that character based input can enhance the performance of current classification systems for morphologically rich languages. Finally, we show that our multiInput-CNN variant is able to perform better than our baselines in two out of three tasks in Hindi and Telugu, while giving comparable results for others.
Sentiment as a Prior for Movie Rating Prediction
BATTU VARSHIT,BATCHU VENKAT VISHAL,DAKANNAGARI MOHANA MURALI KRISHNA REDDY,Radhika Mamidi
International Conference on Innovation in Artificial Intelligence, ICIAI, 2018
@inproceedings{bib_Sent_2018, AUTHOR = {VARSHIT, BATTU and VISHAL, BATCHU VENKAT and REDDY, DAKANNAGARI MOHANA MURALI KRISHNA and Mamidi, Radhika }, TITLE = {Sentiment as a Prior for Movie Rating Prediction}, BOOKTITLE = {International Conference on Innovation in Artificial Intelligence}. YEAR = {2018}}
Movie ratings play an important role in tasks such as user movie recommendations, verifying the relationship between usersubmitted reviews and ratings etc. The ability to predict the rating of a movie would be useful considering these aspects. In this work, firstly, we propose methods to predict the movie rating based on its summary. We then set out to use priors that are generally available with movie summaries in order to improve the accuracy. In order to achieve this, we consider the associated movie reviews as well while predicting the rating and provide insights on why this helps our models perform better. We use the review based sentiment along with the summary in order to predict the rating more accurately since the sentiment captures a lot of essential information that can aid rating prediction. We experiment with various deep learning architectures and the results show a significant accuracy boost of around 2% in most of the models which show the generalizability of our approach.
Context and Humor: Understanding Amul advertisements of India
Radhika Mamidi
Technical Report, arXiv, 2018
@inproceedings{bib_Cont_2018, AUTHOR = {Mamidi, Radhika }, TITLE = {Context and Humor: Understanding Amul advertisements of India}, BOOKTITLE = {Technical Report}. YEAR = {2018}}
Contextual knowledge is the most important element in understanding language. By contextual knowledge we mean both general knowledge and discourse knowledge i.e. knowledge of the situational context, background knowledge and the co-textual context [10]. In this paper, we will discuss the importance of contextual knowledge in understanding the humor present in the cartoon based Amul advertisements in India. In the process, we will analyze these advertisements and also see if humor is an effective tool for advertising and thereby, for marketing. These bilingual advertisements also expect the audience to have the appropriate linguistic knowledge which includes knowledge of English and Hindi vocabulary, morphology and syntax. Different techniques like punning, portmanteaus and parodies of popular proverbs, expressions, acronyms, famous dialogues, songs etc are employed to convey the message in a humorous way. The present study will concentrate on these linguistic cues and the required context for understanding wit and humor.
Resource Creation Towards Automated Sentiment Analysis in Telugu (a low resource language) and Integrating Multiple Domain Sources to Enhance Sentiment Prediction
Gangula Rama Rohit Reddy,Radhika Mamidi
International Conference on Language Resources and Evaluation, LREC, 2018
@inproceedings{bib_Reso_2018, AUTHOR = {Reddy, Gangula Rama Rohit and Mamidi, Radhika }, TITLE = {Resource Creation Towards Automated Sentiment Analysis in Telugu (a low resource language) and Integrating Multiple Domain Sources to Enhance Sentiment Prediction}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2018}}
Understanding the polarity or sentiment of a text is an important task in many application scenarios. Sentiment Analysis of a text can be used to answer various questions such as election prediction, favouredness towards any product etc. But the sentiment analysis task becomes challenging when it comes to low resource languages because the basis of learning sentiment classifiers are annotated datasets and annotated datasets for non-English texts hardly exists. So for the development of sentiment classifiers in Telugu, we have created corpora "Sentiraama" for different domains like movie reviews, song lyrics, product reviews and book reviews in Telugu language with the text written in Telugu script. In this paper, we describe the process of creating the corpora and assigning polarities to them. After the creation of corpora, we trained the classifiers that yields good classification results. Typically a sentiment classifier is trained using data from the same domain it is intended to be tested on. But there may not be sufficient data available in the same domain and additionally using data from multiple sources and domains may help in creating a more generalized sentiment classifier which can be applied to multiple domains. So to create this generalized classifier, we used the sentiment data from the above corpus from different domains. We first tested the performance of sentiment analysis models built using single data source for both in-domain and cross-domain classification. Later, we built sentiment model using data samples from multiple domains and then tested the performance of the models based on their classification. Finally, we compared all the three approaches based on the performance of the models and discussed the best approach for sentiment analysis.
“How to rate a video game?” - A prediction system for video games based on multimodal information
BATCHU VENKAT VISHAL,BATTU VARSHIT,DAKANNAGARI MOHANA MURALI KRISHNA REDDY,Radhika Mamidi
Frontiers in Pattern Recognition and Artificial Intelligence., FIPRAI, 2018
@inproceedings{bib_“H_2018, AUTHOR = {VISHAL, BATCHU VENKAT and VARSHIT, BATTU and REDDY, DAKANNAGARI MOHANA MURALI KRISHNA and Mamidi, Radhika }, TITLE = {“How to rate a video game?” - A prediction system for video games based on multimodal information}, BOOKTITLE = {Frontiers in Pattern Recognition and Artificial Intelligence.}. YEAR = {2018}}
Video games have become an integral part of most people’s lives in recent times. This led to an abundance of data related to video games being shared online. However, this comes with issues such as incorrect ratings, reviews or anything that is being shared. Recommendation systems are powerful tools that help users by providing them with meaningful recommendations. A straightforward approach would be to predict the scores of video games based on other information related to the game. It could be used as a means to validate user-submitted ratings as well as provide recommendations. This work provides a method to predict the G-Score, that defines how good a video game is, from its trailer (video) and summary (text). We first propose models to predict the G-Score based on the trailer alone (unimodal). Later on, we show that considering information from multiple modalities helps the models perform better compared to using information from videos alone. Since we couldn’t find any suitable multimodal video game dataset, we created our own dataset named VGD (Video Game Dataset) and provide it along with this work. The approach mentioned here can be generalized to other multimodal datasets such as movie trailers and summaries etc. Towards the end, we talk about the shortcomings of the work and some methods to overcome them.
Automatic Target Recovery for Hindi-English Code Mixed Puns
SRISHTI AGGARWAL,Kritik Mathur,Radhika Mamidi
International Joint Conference on Artificial Intelligence, IJCAI, 2018
@inproceedings{bib_Auto_2018, AUTHOR = {AGGARWAL, SRISHTI and Mathur, Kritik and Mamidi, Radhika }, TITLE = {Automatic Target Recovery for Hindi-English Code Mixed Puns}, BOOKTITLE = {International Joint Conference on Artificial Intelligence}. YEAR = {2018}}
In order for our computer systems to be more human-like, with a higher emotional quotient, they need to be able to process and understand intrinsic human language phenomena like humour. In this paper, we consider a subtype of humour - puns, which are a common type of wordplay-based jokes. In particular, we consider code-mixed puns which have become increasingly mainstream on social media, in informal conversations and advertisements and aim to build a system which can automatically identify the pun location and recover the target of such puns. We first study and classify code-mixed puns into two categories namely intra-sentential and intra-word, and then propose a four-step algorithm to recover the pun targets for puns belonging to the intra-sentential category. Our algorithm uses language models, and phonetic similarity-based features to get the desired results. We test our approach on a small set of code-mixed punning advertisements, and observe that our system is successfully able to recover the targets for 67% of the puns.
Automatic Spelling Correction for Resource-Scarce Languages using Deep Learning
ETOORI PRAVALLIKA,Manoj Chinnakotla,Radhika Mamidi
Student Research Workshop, SRW, 2018
@inproceedings{bib_Auto_2018, AUTHOR = {PRAVALLIKA, ETOORI and Chinnakotla, Manoj and Mamidi, Radhika }, TITLE = {Automatic Spelling Correction for Resource-Scarce Languages using Deep Learning}, BOOKTITLE = {Student Research Workshop}. YEAR = {2018}}
Spelling correction is a well-known task in Natural Language Processing (NLP). Automatic spelling correction is important for many NLP applications like web search engines, text summarization, sentiment analysis etc. Most approaches use parallel data of noisy and correct word mappings from different sources as training data for automatic spelling correction. Indic languages are resourcescarce and do not have such parallel data due to low volume of queries and nonexistence of such prior implementations. In this paper, we show how to build an automatic spelling corrector for resourcescarce languages. We propose a sequenceto-sequence deep learning model which trains end-to-end. We perform experiments on synthetic datasets created for Indic languages, Hindi and Telugu, by incorporating the spelling mistakes committed at character level. A comparative evaluation shows that our model is competitive with the existing spell checking and correction techniques for Indic languages.
Exploring Chunk Based Templates for Generating a subset of English Text
NIKHILESH BHATNAGAR,Manish Srivastava,Radhika Mamidi
Student Research Workshop, SRW, 2018
@inproceedings{bib_Expl_2018, AUTHOR = {BHATNAGAR, NIKHILESH and Srivastava, Manish and Mamidi, Radhika }, TITLE = {Exploring Chunk Based Templates for Generating a subset of English Text}, BOOKTITLE = {Student Research Workshop}. YEAR = {2018}}
Natural Language Generation (NLG) is a research task which addresses the automatic generation of natural language text representative of an input non-linguistic collection of knowledge. In this paper, we address the task of the generation of grammatical sentences in an isolated context given a partial bag-of-words which the generated sentence must contain. We view the task as a search problem (a problem of choice) involving combinations of smaller chunk based templates extracted from a training corpus to construct a complete sentence. To achieve that, we propose a fitness function which we use in conjunction with an evolutionary algorithm as the search procedure to arrive at a potentially grammatical sentence (modeled by the fitness score) which satisfies the input constraints.
Addition of Code Mixed Features to Enhance the Sentiment Prediction of Song Lyrics
Gangula Rama Rohit Reddy,Radhika Mamidi
International Joint Conference on Artificial Intelligence and the 23rd European Conference on Artifi, (IJCAI-ECAI, 2018
@inproceedings{bib_Addi_2018, AUTHOR = {Reddy, Gangula Rama Rohit and Mamidi, Radhika }, TITLE = {Addition of Code Mixed Features to Enhance the Sentiment Prediction of Song Lyrics}, BOOKTITLE = {International Joint Conference on Artificial Intelligence and the 23rd European Conference on Artifi}. YEAR = {2018}}
Sentiment analysis, also called opinion mining, is the field of study that analyzes people’s opinions, sentiments, attitudes and emotions. Songs are important to sentiment analysis since the songs and mood are mutually dependent on each other. Based on the selected song it becomes easy to find the mood of the listener, in future it can be used for recommendation. The song lyric is a rich source of datasets containing words that are helpful in analysis and classification of sentiments generated from it. Now a days we observe a lot of inter-sentential and intra-sentential code-mixing in songs which has a varying impact on audience. To study this impact we created a Telugu songs dataset which contained both Telugu-English code-mixed and pure Telugu songs. In this paper, we classify the songs based on its arousal as exciting or non-exciting. We develop a language identification tool and introduce code-mixing features obtained from it as additional features. Our system with these additional features attains 4-5% accuracy greater than traditional approaches on our dataset.
Towards Automation of Sense-type Identification of Verbs in OntoSenseNet (Telugu)
PARUPALLI SREEKAVITHA,V ANVESH RAO,Radhika Mamidi
International Workshop on Natural Language Processing for Social Media;, SocialNLP-W, 2018
@inproceedings{bib_Towa_2018, AUTHOR = {SREEKAVITHA, PARUPALLI and RAO, V ANVESH and Mamidi, Radhika }, TITLE = {Towards Automation of Sense-type Identification of Verbs in OntoSenseNet (Telugu)}, BOOKTITLE = {International Workshop on Natural Language Processing for Social Media;}. YEAR = {2018}}
In this paper, we discuss the enrichment of a manually developed resource of Telugu lexicon, OntoSenseNet. OntoSenseNet is a ontological sense annotated lexicon that marks each verb of Telugu with a primary and a secondary sense. The area of research is relatively recent but has a large scope of development. We provide an introductory work to enrich the OntoSenseNet to promote further research in Telugu. Classifiers are adopted to learn the sense relevant features of the words in the resource and also to automate the tagging of sense-types for verbs. We perform a comparative analysis of different classifiers applied on OntoSenseNet. The results of the experiment prove that automated enrichment of the resource is effective using SVM classifiers and Adaboost ensemble.
BCSAT : A Benchmark Corpus for Sentiment Analysis in Telugu Using Word-level Annotations
PARUPALLI SREEKAVITHA,V ANVESH RAO,Radhika Mamidi
Student Research Workshop, SRW, 2018
@inproceedings{bib_BCSA_2018, AUTHOR = {SREEKAVITHA, PARUPALLI and RAO, V ANVESH and Mamidi, Radhika }, TITLE = {BCSAT : A Benchmark Corpus for Sentiment Analysis in Telugu Using Word-level Annotations}, BOOKTITLE = {Student Research Workshop}. YEAR = {2018}}
The presented work aims at generating a systematically annotated corpus that can support the enhancement of sentiment analysis tasks in Telugu using wordlevel sentiment annotations. From OntoSenseNet, we extracted 11,000 adjectives, 253 adverbs, 8483 verbs and sentiment annotation is being done by language experts. We discuss the methodology followed for the polarity annotations and validate the developed resource. This work aims at developing a benchmark corpus, as an extension to SentiWordNet, and baseline accuracy for a model where lexeme annotations are applied for sentiment predictions. The fundamental aim of this paper is to validate and study the possibility of utilizing machine learning algorithms, word-level sentiment annotations in the task of automated sentiment identification. Furthermore, accuracy is improved by annotating the bi-grams extracted from the target corpus.
Towards Enhancing Lexical Resource and Using Sense-annotations of OntoSenseNet for Sentiment Analysis
PARUPALLI SREEKAVITHA,V ANVESH RAO,Radhika Mamidi
Workshop on Semantic Deep Learning, SemDeep, 2018
@inproceedings{bib_Towa_2018, AUTHOR = {SREEKAVITHA, PARUPALLI and RAO, V ANVESH and Mamidi, Radhika }, TITLE = {Towards Enhancing Lexical Resource and Using Sense-annotations of OntoSenseNet for Sentiment Analysis}, BOOKTITLE = {Workshop on Semantic Deep Learning}. YEAR = {2018}}
This paper illustrates the interface of the tool we developed for crowd sourcing and we explain the annotation procedure in detail. Our tool is named as ‘Parupalli Padajaalam1 ’ which means web of words by Parupalli. The aim of this tool is to populate the OntoSenseNet, sentiment polarity annotated Telugu resource. Recent works have shown the importance of word-level annotations on sentiment analysis. With this as basis, we aim to analyze the importance of sense-annotations obtained from OntoSenseNet in performing the task of sentiment analysis. We explain the features extracted from OntoSenseNet (Telugu). Furthermore we compute and explain the adverbial class distribution of verbs in OntoSenseNet. This task is known to aid in disambiguating wordsenses which helps in enhancing the performance of word-sense disambiguation (WSD) task(s).
Political Discourse Analysis : A Case Study of 2014 Andhra Pradesh State Assembly Election of Interpersonal Speech Choices
Dama Sravani,V A Lalitha Kameswari,Radhika Mamidi
Pacific Asia Conference on Language, Information and Computation, PACLIC, 2018
@inproceedings{bib_Poli_2018, AUTHOR = {Sravani, Dama and Kameswari, V A Lalitha and Mamidi, Radhika }, TITLE = {Political Discourse Analysis : A Case Study of 2014 Andhra Pradesh State Assembly Election of Interpersonal Speech Choices}, BOOKTITLE = {Pacific Asia Conference on Language, Information and Computation}. YEAR = {2018}}
Since the beginning of the 20th century, people have started studying the correlation between language and culture. They observed how language is used in discourses to establish power relations in the society. Eggins (2004) found that the link between language and the choice made by the speaker in the exchange enable us to see speakers making meaning about interpersonal: the extent of their intimacy, their level of familiarity with each other and their attitudes and judgments. In political speeches, the speaker uses language to persuade the voters, influence their perceptions and build a positive interpersonal identity. Keeping in mind this result-oriented attempt by the speakers, Van Dijk and others (1997) describe discourse as political when it has a direct functional role as a form of political action in the political process. In this paper, we will look at four such speeches given by notable politicians from both winning and losing parties during the campaign of Andhra Pradesh State Assembly elections of 2014 and closely observe the linguistic choices made at the lexical and semantic levels. By a contrastive analysis of the speeches of winning and losing parties, we can identify the linguistic features which contribute to the outcome.
Syllables for Sentence Classification in Morphologically Rich Languages
Tummalapalli Madhuri,Radhika Mamidi
Pacific Asia Conference on Language, Information and Computation, PACLIC, 2018
@inproceedings{bib_Syll_2018, AUTHOR = {Madhuri, Tummalapalli and Mamidi, Radhika }, TITLE = {Syllables for Sentence Classification in Morphologically Rich Languages}, BOOKTITLE = {Pacific Asia Conference on Language, Information and Computation}. YEAR = {2018}}
Sentence Classification is one of the most fundamental tasks in NLP, where the aim is to classify a given sentence into a pre-defined set of classes. A lot of work has been done in English in the last few years, which vary in their methodologies. A huge proportion of these works represent the input sentences as a sequence of words in their models. Only a few of them rely on character level representation. Through this work, we introduce a new method for representing a sentence - as a sequence of syllables. As we show in this work, syllables are a better choice to represent the sub-word level information in a sentence, which is essential for morphologically rich languages. We consider the tasks of Sentiment Analysis and Question Classification in three languages showing varied morphological richness - English, Hindi and Telugu. Through extensive evaluation, we show that syllables are the best performing input type when compared to words or characters for the morphologically rich languages - Hindi and Telugu.
Multimodal Sentiment Analysis of Telugu Songs
Harika Abburi,A ESWAR SAI AKHIL,Suryakanth Gangashetty,Radhika Mamidi
International Joint Conference on Artificial Intelligence, IJCAI, 2017
@inproceedings{bib_Mult_2017, AUTHOR = {Abburi, Harika and AKHIL, A ESWAR SAI and Gangashetty, Suryakanth and Mamidi, Radhika }, TITLE = {Multimodal Sentiment Analysis of Telugu Songs}, BOOKTITLE = {International Joint Conference on Artificial Intelligence}. YEAR = {2017}}
In this paper, an approach to detect the sentiment of a song based on its multi-modality natures (text and audio) is presented. The textual lyric features are extracted from the bag of words. By using these features, Doc2Vec will generate a single vector for each song. Support Vector Machine (SVM), Naive Bayes (NB) and a combination of both these classifiers are developed to classify the sentiment using the textual lyric features. Audio features are used as an add-on to the lyrical ones which include prosody features, temporal features, spectral features, tempo and chroma features. Gaussian Mixture Models (GMM), SVM and a combination of both these classifiers are developed to classify the sentiment using audio features. GMM are known for capturing the distribution in the features and SVM are known for discriminating the features. Hence these models are combined to improve the performance of sentiment analysis. Performance is further improved by combining the text and audio feature domains. These text and audio features are extracted at the beginning, ending and for the whole song. From our experimental results, it is observed that the first 30 seconds (s) of a song gives better performance for detecting the sentiment of the song rather than the last 30s or from the whole song.
New data is indeed helping lexical simplification
ASHISH PALAKURTHI,Radhika Mamidi
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2017
@inproceedings{bib_New__2017, AUTHOR = {PALAKURTHI, ASHISH and Mamidi, Radhika }, TITLE = {New data is indeed helping lexical simplification}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2017}}
We propose the use of the Newsela corpus for Complex Word Identification, a sub-problem of Lexical Simplification and conduct an empirical evaluation by comparing it with benchmark corpora previously employed for this task. Our experiments suggest that the proposed corpus is effective for Complex Word Identification, thus helping Lexical Simplification.
Automatic Generation of Jokes in Hindi
SRISHTI AGGARWAL,Radhika Mamidi
Conference of the Association of Computational Linguistics, ACL, 2017
@inproceedings{bib_Auto_2017, AUTHOR = {AGGARWAL, SRISHTI and Mamidi, Radhika }, TITLE = {Automatic Generation of Jokes in Hindi}, BOOKTITLE = {Conference of the Association of Computational Linguistics}. YEAR = {2017}}
When it comes to computational language generation systems, humour is a relatively unexplored domain, especially more so for Hindi (or rather, for most languages other than English). Most researchers agree that a joke consists of two main parts-the setup and the punchline, with humour being encoded in the incongruity between the two. In this paper, we look at Dur se Dekha jokes, a restricted domain of humorous three liner poetry in Hindi. We analyze their structure to understand how humour is encoded in them and formalize it. We then develop a system which is successfully able to generate a basic form of these jokes.
When does a compliment become sexist? analysis and classification of ambivalent sexism using twitter data
AKSHITA JHA,Radhika Mamidi
workshop on NLP and computational social science, NLPCSS-W, 2017
@inproceedings{bib_When_2017, AUTHOR = {JHA, AKSHITA and Mamidi, Radhika }, TITLE = {When does a compliment become sexist? analysis and classification of ambivalent sexism using twitter data}, BOOKTITLE = {workshop on NLP and computational social science}. YEAR = {2017}}
Sexism is prevalent in today’s society, both offline and online, and poses a credible threat to social equality with respect to gender. According to ambivalent sexism theory (Glick and Fiske, 1996), it comes in two forms: Hostile and Benevolent. While hostile sexism is characterized by an explicitly negative attitude, benevolent sexism is more subtle. Previous works on computationally detecting sexism present online are restricted to identifying the hostile form. Our objective is to investigate the less pronounced form of sexism demonstrated online. We achieve this by creating and analyzing a dataset of tweets that exhibit benevolent sexism. By using Support Vector Machines (SVM), sequence-to-sequence models and FastText classifier, we classify tweets into ‘Hostile’,‘Benevolent’or ‘Others’ class depending on the kind of sexism they exhibit. We have been able to achieve an F1-score of 87.22% using FastText classifier. Our work helps analyze and understand the much prevalent ambivalent sexism in social media.
Bolly: Annotation of sentiment polarity in bollywood lyrics dataset
G DRUSHTI APOORVA,Radhika Mamidi
Conference of the Pacific Association for Computational Linguistics, PACLING, 2017
@inproceedings{bib_Boll_2017, AUTHOR = {APOORVA, G DRUSHTI and Mamidi, Radhika }, TITLE = {Bolly: Annotation of sentiment polarity in bollywood lyrics dataset}, BOOKTITLE = {Conference of the Pacific Association for Computational Linguistics}. YEAR = {2017}}
This work presents a corpus of Bollywood song lyrics and its metadata, annotated with sentiment polarity. We call this BolLy. It contains lyrics of 1055 songs ranging from those composed in the year 1970 to the most recent ones. This dataset is of utmost value as all the annotation is done manually by three annotators and this makes it a very rich dataset for training purposes. In this work, we describe the creation and annotation process, content, and the possible uses of the dataset. As an experiment, we have built a basic classification system to identify the emotion polarity of the song based solely on the lyrics and this can be used as a baseline algorithm for the same. BolLy can also be used for studying code-mixing with respect to lyrics.
Tag me a label with multi-arm: Active learning for telugu sentiment analysis
SANDEEP SRICHARAN MUKKU,OOTA SUBBA REDDY,Radhika Mamidi
International Conference on Big Data Analytics and Knowledge Discovery, ICBDAKD, 2017
@inproceedings{bib_Tag__2017, AUTHOR = {MUKKU, SANDEEP SRICHARAN and REDDY, OOTA SUBBA and Mamidi, Radhika }, TITLE = {Tag me a label with multi-arm: Active learning for telugu sentiment analysis}, BOOKTITLE = {International Conference on Big Data Analytics and Knowledge Discovery}. YEAR = {2017}}
Sentiment Analysis is one of the most active research areas in natural language processing and an extensively studied problem in data mining, web mining and text mining for English language. With the proliferation of social media these days, data is widely increasing in regional languages along with English. Telugu is one such regional language with abundant data available in social media, but it’s hard to find a labeled training set as human annotation is time-consuming and cost-ineffective. To address this issue, in this paper the practicality of active learning for Telugu sentiment analysis is investigated. We built a hybrid approach by combining different query selection strategy frameworks to increase more accurate training data instances with limited labeled data. Using a set of classifiers like SVM, XGBoost, and Gradient Boosted Trees (GBT), we achieved promising results with minimal error rate.
Actsa: Annotated corpus for telugu sentiment analysis
SANDEEP SRICHARAN MUKKU,Radhika Mamidi
Workshop on Building Linguistically Generalizable NLP Systems, BLGNLP-W, 2017
@inproceedings{bib_Acts_2017, AUTHOR = {MUKKU, SANDEEP SRICHARAN and Mamidi, Radhika }, TITLE = {Actsa: Annotated corpus for telugu sentiment analysis}, BOOKTITLE = {Workshop on Building Linguistically Generalizable NLP Systems}. YEAR = {2017}}
Sentiment analysis deals with the task of determining the polarity of a document or sentence and has received a lot of attention in recent years for the English language. With the rapid growth of social media these days, a lot of data is available in regional languages besides English. Telugu is one such regional language with abundant data available in social media, but it’s hard to find a labelled data of sentences for Telugu Sentiment Analysis. In this paper, we describe an effort to build a gold-standard annotated corpus of Telugu sentences to support Telugu Sentiment Analysis. The corpus, named ACTSA (Annotated Corpus for Telugu Sentiment Analysis) has a collection of Telugu sentences taken from different sources which were then pre-processed and manually annotated by native Telugu speakers using our annotation guidelines. In total, we have annotated 5457 sentences, which makes our corpus the largest resource currently available. The corpus and the annotation guidelines are made publicly available.
Building a SentiWordNet For Odia
GAURAV MOHANTY,ABISHEK KANNAN,Radhika Mamidi
Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, WASSA, 2017
@inproceedings{bib_Buil_2017, AUTHOR = {MOHANTY, GAURAV and KANNAN, ABISHEK and Mamidi, Radhika }, TITLE = {Building a SentiWordNet For Odia}, BOOKTITLE = {Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis}. YEAR = {2017}}
As a discipline of Natural Language Processing, Sentiment Analysis is used to extract and analyze subjective information present in natural language data. The task of Sentiment Analysis has acquired wide commercial uses including social media monitoring tasks, survey responses, review systems, etc. Languages like English have several resources which aid in the task of Sentiment Analysis. SentiWord-Net and Subjectivity WordList are examples of such tools and resources. With more data being available in native vernacular, language-specific SentiWordNet (s) have become essential. For resource poor languages, creating such SentiWordNet (s) is a difficult task to achieve. One solution is to use available resources in English and translate the final source lexicon to target lexicon via machine translation. Machine translation systems for the English-Odia language pair have not yet been developed. In this paper, we discuss a method to create a SentiWordNet for Odia, which is resource-poor, by only using resources which are currently available for Indian languages. The lexicon created, would serve as a tool for Sentiment Analysis related task specific to Odia data.
Multi-Arm Active Transfer Learning for Telugu Sentiment Analysis.
OOTA SUBBA REDDY,I Vijayasaradhi,Mounika Marreddy,SANDEEP SRICHARAN MUKKU,Radhika Mamidi
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databa, PKDD/ECML, 2017
@inproceedings{bib_Mult_2017, AUTHOR = {REDDY, OOTA SUBBA and Vijayasaradhi, I and Marreddy, Mounika and MUKKU, SANDEEP SRICHARAN and Mamidi, Radhika }, TITLE = {Multi-Arm Active Transfer Learning for Telugu Sentiment Analysis.}, BOOKTITLE = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databa}. YEAR = {2017}}
Transfer learning algorithms can be used when sufficient amount of training data is available in the source domain and limited training data is available in the target domain. The transfer of knowledge from one domain to another requires similarity between two domains. In many resource-poor languages, it is rare to find labeled training data in both the source and target domains. Active learning algorithms, which query more labels from an oracle, can be used effectively in training the source domain when an oracle is available in the source domain but not available in the target domain. Active learning strategies are subjective as they are designed by humans. It can be time consuming to design a strategy and it can vary from one human to other. To tackle all these problems, we design a learning algorithm that connects transfer learning and active learning with the well-known multi-armed bandit problem by querying …
Domain independent keyword identification for question answering
PRATHYUSHA JWALAPURAM,Radhika Mamidi
International Conference on Asian Language Processing, IALP, 2017
@inproceedings{bib_Doma_2017, AUTHOR = {JWALAPURAM, PRATHYUSHA and Mamidi, Radhika }, TITLE = {Domain independent keyword identification for question answering}, BOOKTITLE = {International Conference on Asian Language Processing}. YEAR = {2017}}
In this paper, we look at domain independent keyword identification for natural language queries using statistical methods. We took queries supplemented by only their dependency tags (Stanford Parser) and part-of-speech tags (Stanford POS tagger) and labeled the keywords. We then delexicalised the training data, and used the Conditional Random Fields algorithm to learn these labels. We used the queries created by [1] in the course management domain for training, and tested our model on the queries of three domains: course management, library and the GeoQueries250 dataset and report fairly high accuracies of 90.65%, 83.19% and 97.13% respectively, making our model a truly domain independent and highly accurate keyword identifier.
“nee intention enti?” towards dialog act recognition in code-mixed conversations
J DIVYA SAI,CHANDU KHYATHI RAGHAVI,PAMIDIPALLI GNANA SRI HARSHA,Radhika Mamidi
International Conference on Asian Language Processing, IALP, 2017
@inproceedings{bib_“n_2017, AUTHOR = {SAI, J DIVYA and RAGHAVI, CHANDU KHYATHI and HARSHA, PAMIDIPALLI GNANA SRI and Mamidi, Radhika }, TITLE = {“nee intention enti?” towards dialog act recognition in code-mixed conversations}, BOOKTITLE = {International Conference on Asian Language Processing}. YEAR = {2017}}
Code-Mixing (CM) is a very commonly observed mode of communication in a multilingual configuration. The trends of using this newly emerging language has its effect as a culling option especially in platforms like social media. This becomes particularly important in the context of technology and health, where expressing the upcoming advancements is difficult in native language. Despite the change of such language dynamics, current dialog systems cannot handle a switch between languages across sentences and mixing within a sentence. Everyday conversations are fabricated in this mixed language and analyzing dialog acts in this language is very essential in further advancements of making interaction with personal assistants more natural. The problem is further compounded with crossing the script barriers in code-mixing. In this paper we take the first step towards understanding code-mixing in dialog …
Multimodal Sentiment Analysis of Telugu Songs
HARIKA ABBURI,A ESWAR SAI AKHIL,Suryakanth Gangashetty,Radhika Mamidi
International Joint Conference on Artificial Intelligence, IJCAI, 2016
@inproceedings{bib_Mult_2016, AUTHOR = {ABBURI, HARIKA and AKHIL, A ESWAR SAI and Gangashetty, Suryakanth and Mamidi, Radhika }, TITLE = {Multimodal Sentiment Analysis of Telugu Songs}, BOOKTITLE = {International Joint Conference on Artificial Intelligence}. YEAR = {2016}}
In this paper, an approach to detect the sentiment of a song based on its multi-modality natures (text and audio) is presented. The textual lyric features are extracted from the bag of words. By using these features, Doc2Vec will generate a single vector for each song. Support Vector Machine (SVM), Naive Bayes (NB) and a combination of both these classifiers are developed to classify the sentiment using the textual lyric features. Audio features are used as an add-on to the lyrical ones which include prosody features, temporal features, spectral features, tempo and chroma features. Gaussian Mixture Models (GMM), SVM and a combination of both these classifiers are developed to classify the sentiment using audio features. GMM are known for capturing the distribution in the features and SVM are known for discriminating the features. Hence these models are combined to improve the performance of sentiment analysis. Performance is further improved by combining the text and audio feature domains. These text and audio features are extracted at the beginning, ending and for the whole song. From our experimental results, it is observed that the first 30 seconds (s) of a song gives better performance for detecting the sentiment of the song rather than the last 30s or from the whole song.
Resource Creation for Hindi-English Code Mixed Social Media Text
SAKSHI GUPTA,PIYUSH BANSAL,Radhika Mamidi
International Joint Conference on Artificial Intelligence, IJCAI, 2016
@inproceedings{bib_Reso_2016, AUTHOR = {GUPTA, SAKSHI and BANSAL, PIYUSH and Mamidi, Radhika }, TITLE = {Resource Creation for Hindi-English Code Mixed Social Media Text}, BOOKTITLE = {International Joint Conference on Artificial Intelligence}. YEAR = {2016}}
Code-mixing is a linguistic phenomena frequently observed in user generated content on social media, especially by multilingual users. Apart from the inherent linguistic complexity, the analysis of code-mixed content poses complex challenges owing to the presence of spelling variations, transliteration and non-adherence to a formal grammar. However, for any downstream Natural Language Processing task, tools that are able to process and analyze code-mixed data are required. Currently there is a lack of publicly available resources for code-mixed Hindi-English data, while the amount of such text is increasing everyday. In this study, our focus is on creation of a dataset that has codemixed Hindi-English sentences along with the associated language and normalisation labels. To the best of our knowledge, our work is the first attempt at the creation of a linguistic resource for this language pair, which is also made public. In this work, we also present an empirical study detailing the construction of a language identification and normalisation system designed for this language pair.
A Karaka Dependency based Dialog Act Tagging for Telugu using combination of LMs and HMM
Dowlagar suman,Radhika Mamidi
International Conference on Computational Linguistics and Intelligent Text Processing, ICITPCL, 2016
@inproceedings{bib_A_Ka_2016, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {A Karaka Dependency based Dialog Act Tagging for Telugu using combination of LMs and HMM}, BOOKTITLE = {International Conference on Computational Linguistics and Intelligent Text Processing}. YEAR = {2016}}
The main goal of this paper is to perform the dialog act(DA) tagging for Telugu corpus. Annotation of utterances with dialog acts is necessary to recognize the intent of speaker in dialog systems. While English language follows a strict subject–verb–object(SVO) syntax, Telugu is a free word order language. The n-gram DA tagging methods proposed for English language will not work for free word order languages like Telugu. In this paper, we propose a method to perform DA tagging for Telugu corpus using advanced machine learning techniques combined with karaka dependency relation modifiers. In other words, we use syntactic features obtained from karaka dependencies and apply combination of language models(LMs) at utterance level with Hidden Markov Model(HMM) at context level for DA tagging. The use of karaka dependencies for free word order languages like Telugu helps in extracting the
Part-of-Speech Tagging for Code mixed English-Telugu Social media data
NELAKUDITI KOVIDA,J DIVYA SAI,Radhika Mamidi
International Conference on Computational Linguistics and Intelligent Text Processing, ICITPCL, 2016
@inproceedings{bib_Part_2016, AUTHOR = {KOVIDA, NELAKUDITI and SAI, J DIVYA and Mamidi, Radhika }, TITLE = {Part-of-Speech Tagging for Code mixed English-Telugu Social media data}, BOOKTITLE = {International Conference on Computational Linguistics and Intelligent Text Processing}. YEAR = {2016}}
Part-of-Speech Tagging is a primary and an important step for many Natural Language Processing Applications. POS taggers have reported high accuracies on grammatically correct monolingual data. This paper reports work on annotating code mixed English-Telugu data collected from social media site Facebook and creating automatic POS Taggers for this corpus. POS tagging is considered as a classification problem and we use different classifiers like Linear SVMs, CRFs, Multinomial Bayes with different combinations of features which capture both context of the word and its internal structure. We also report our work on experimenting with combining monolingual POS taggers for POS tagging of this code mixed English-Telugu data.
Shallow parsing pipeline for hindi-english code-mixed social media text
ARNAV SHARMA,SAKSHI GUPTA,RAVEESH MOTLANI,PIYUSH BANSAL,Manish Srivastava,Radhika Mamidi,Dipti Mishra Sharma
Conference of the North American Chapter of the Association for Computational Linguistics, NAACL, 2016
@inproceedings{bib_Shal_2016, AUTHOR = {SHARMA, ARNAV and GUPTA, SAKSHI and MOTLANI, RAVEESH and BANSAL, PIYUSH and Srivastava, Manish and Mamidi, Radhika and Sharma, Dipti Mishra }, TITLE = {Shallow parsing pipeline for hindi-english code-mixed social media text}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics}. YEAR = {2016}}
n this study, the problem of shallow parsing of Hindi-English code-mixed social media text (CSMT) has been addressed. We have annotated the data, developed a language identifier, a normalizer, a part-of-speech tagger and a shallow parser. To the best of our knowledge, we are the first to attempt shallow parsing on CSMT. The pipeline developed has been made available to the research community with the goal of enabling better text analysis of Hindi English CSMT. The pipeline is accessible at this http URL.
Experiments in linear template combination using genetic algorithms
NIKHILESH BHATNAGAR,Radhika Mamidi
Technical Report, arXiv, 2016
@inproceedings{bib_Expe_2016, AUTHOR = {BHATNAGAR, NIKHILESH and Mamidi, Radhika }, TITLE = {Experiments in linear template combination using genetic algorithms}, BOOKTITLE = {Technical Report}. YEAR = {2016}}
Natural Language Generation systems typically have two parts-strategic ('what to say') and tactical ('how to say'). We present our experiments in building an unsupervised corpus-driven template based tactical NLG system. We consider templates as a sequence of words containing gaps. Our idea is based on the observation that templates are grammatical locally (within their textual span). We posit the construction of a sentence as a highly restricted sequence of such templates. This work is an attempt to explore the resulting search space using Genetic Algorithms to arrive at acceptable solutions. We present a baseline implementation of this approach which outputs gapped text.
Iiit at semeval-2016 task 11: Complex word identification using nearest centroid classification
ASHISH PALAKURTHI,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2016
@inproceedings{bib_Iiit_2016, AUTHOR = {PALAKURTHI, ASHISH and Mamidi, Radhika }, TITLE = {Iiit at semeval-2016 task 11: Complex word identification using nearest centroid classification}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2016}}
Proceedings of SemEval-2016, pages 1017–1021,San Diego, California, June 16-17, 2016.c©2016 Association for Computational LinguisticsIIIT at SemEval-2016 Task 11: Complex Word Identification usingNearest Centroid Classification
Enhanced sentiment classification of Telugu text using ML techniques
SANDEEP SRICHARAN MUKKU,NURENDRA CHOUDHARY,Radhika Mamidi
International Joint Conference on Artificial Intelligence, IJCAI, 2016
@inproceedings{bib_Enha_2016, AUTHOR = {MUKKU, SANDEEP SRICHARAN and CHOUDHARY, NURENDRA and Mamidi, Radhika }, TITLE = {Enhanced sentiment classification of Telugu text using ML techniques}, BOOKTITLE = {International Joint Conference on Artificial Intelligence}. YEAR = {2016}}
With the growing amount of information and availability of opinion-rich resources, it is sometimes difficult for a common man to analyse what others think of. To analyse this information and to see what people in general think or feel of a product or a service is the problem of Sentiment Analysis. Sentiment analysis or Sentiment polarity labelling is an emerging field, so this needs to be accurate. In this paper, we explore various Machine Learning techniques for the classification of Telugu sentences into positive or negative polarities.
Towards building a sentiwordnet for tamil
ABISHEK KANNAN,GAURAV MOHANTY,Radhika Mamidi
International Conference on Natural Language Processing., ICON, 2016
@inproceedings{bib_Towa_2016, AUTHOR = {KANNAN, ABISHEK and MOHANTY, GAURAV and Mamidi, Radhika }, TITLE = {Towards building a sentiwordnet for tamil}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2016}}
Sentiment analysis is a discipline of Natural Language Processing which deals with analysing the subjectivity of the data. It is an important task with both commercial and academic functionality. Languages like English have several resources which assist in the task of sentiment analysis. SentiWordNet for English is one such important lexical resource that contains subjective polarity for each lexical item. With growing data in native vernacular, there is a need for language-specific SentiWordNet (s). In this paper, we discuss a generic approach followed for the development of a Tamil SentiWordNet using currently available resources in English. For Tamil SentiWordNet, a substantial agreement Fleiss Kappa score of 0.663 was obtained after verification from Tamil annotators. Such a resource would serve as a baseline for future improvements in the task of sentiment analysis specific to Tamil data.
Subtopic Boundary Identification in Hindi Dialogue
DARSHAN AGARWAL,Vandan Mujadia,Radhika Mamidi
International Conference on Asian Language Processing, IALP, 2015
@inproceedings{bib_Subt_2015, AUTHOR = {AGARWAL, DARSHAN and Mujadia, Vandan and Mamidi, Radhika }, TITLE = {Subtopic Boundary Identification in Hindi Dialogue}, BOOKTITLE = {International Conference on Asian Language Processing}. YEAR = {2015}}
This paper describes the techniques for the automatic detection of subtopic/subplan boundary in Hindi dialogue using structure of dialogue, dialogue acts, shallow linguistic features and word co-occurrence. Our experiments illustrate that the use of dialogue structure, word co-occurrence and wordnet improves the boundary identification for Hindi natural dialogues.
Paninian Grammar Based Hindi Dialogue Anaphora Resolution
Vandan Mujadia,DARSHAN AGARWAL,Radhika Mamidi,Dipti Mishra Sharma
International Conference on Asian Language Processing, IALP, 2015
@inproceedings{bib_Pani_2015, AUTHOR = {Mujadia, Vandan and AGARWAL, DARSHAN and Mamidi, Radhika and Sharma, Dipti Mishra }, TITLE = {Paninian Grammar Based Hindi Dialogue Anaphora Resolution}, BOOKTITLE = {International Conference on Asian Language Processing}. YEAR = {2015}}
In this paper, we present a Paninian grammar based heuristic model1 to resolve entity-pronoun references in Hindi dialogue. We explore the use of Paninian based dependency structures as a source of syntactico-semantic information. Our experiments illustrate that the use of dependency and dialogue structures help to resolve specific types of references. We also show that named entity, discourse information like subtopic boundary and animacy features increase the overall resolution accuracy to 64% for user-user interaction data and 59% for play-story corpora.
Classification of Attributes in a Natural Language Query into Different SQL Clauses
ASHISH PALAKURTHI,RUTHU S M, Arjun Akula,Radhika Mamidi
Recent advance in Natural language Processing, RANLP, 2015
@inproceedings{bib_Clas_2015, AUTHOR = {PALAKURTHI, ASHISH and M, RUTHU S and Akula, Arjun and Mamidi, Radhika }, TITLE = {Classification of Attributes in a Natural Language Query into Different SQL Clauses}, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2015}}
Attribute information in a natural language query is one of the key features for converting a natural language query into a Structured Query Language1 (SQL) in Natural Language Interface to Database systems. In this paper, we explore the task of classifying the attributes present in a natural language query into different SQL clauses in a SQL query. In particular, we investigate the effectiveness of various features and Conditional Random Fields for this task. Our system uses a statistical classifier trained on manually prepared data. We report our results on three different domains and also show how our system can be used for generating a complete SQL query.
A Semi Supervised Dialog Act Tagging for Telugu
Dowlagar suman,Radhika Mamidi
International Conference on Natural Language Processing., ICON, 2015
@inproceedings{bib_A_Se_2015, AUTHOR = {Suman, Dowlagar and Mamidi, Radhika }, TITLE = {A Semi Supervised Dialog Act Tagging for Telugu}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2015}}
In a task oriented domain, recognizing the intention of a speaker is important so that the conversation can proceed in the correct direction. This is possible only if there is a way of labeling the utterance with its proper intent. One such labeling techniques is Dialog Act (DA) tagging. This work focuses on discussing various n-gram DA tagging techniques. In this paper, a new method is proposed for DA tagging in Telugu using n-gram karakas with back-off as n-gram language modeling technique at n-gram level and Memory Based Learning at utterance level. The results show that the proposed method is on par with manual DA tagging.
Statistical Sandhi Splitter and its Effect on NLP Applications
K PRATHYUSHA,NELAKUDITI KOVIDA,Radhika Mamidi
Recent advance in Natural language Processing, RANLP, 2015
@inproceedings{bib_Stat_2015, AUTHOR = {PRATHYUSHA, K and KOVIDA, NELAKUDITI and Mamidi, Radhika }, TITLE = {Statistical Sandhi Splitter and its Effect on NLP Applications}, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2015}}
This paper revisits the work of (Kuncham et al., 2015) which developed a statistical sandhi splitter (SSS) for agglutinative languages that was tested for Telugu and Malayalam languages. Handling compound words is a major challenge for Natural Language Processing (NLP) applications for agglutinative languages. Hence, in this paper we concentrate on testing the effect of SSS on the NLP applications like Machine Translation, Dialogue System and Anaphora Resolution and show that the accuracy of these applications is consistently improved by using SSS. We shall also discuss in detail the performance of SSS on these applications.
Resolution of Pronominal Anaphora for Telugu Dialogues
JONNALAGADDA HEMANTH REDDY,Radhika Mamidi
International Conference on Natural Language Processing., ICON, 2015
@inproceedings{bib_Reso_2015, AUTHOR = {REDDY, JONNALAGADDA HEMANTH and Mamidi, Radhika }, TITLE = {Resolution of Pronominal Anaphora for Telugu Dialogues}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2015}}
The challenge of anaphora resolution has been taken up from long time. However, most of the work did not include for dialogues. In this paper we discuss the types of pronouns and anaphora in Telugu language and make an attempt to build a rule based pronominal anaphora resolution algorithm for human to human conversations. The model mainly consists of two parts, creating a knowledge base with a set of pronouns along with its morphological information and designing an algorithm which uses this knowledge base to give an output. In this process we have worked on normal pronominal anaphora and suggested a set of rules applicable for Telugu dialogues. However, since there was no corpus for the Telugu language, we built a corpus and tested the algorithm on it. Results show that the suggested algorithm produced an output with an accuracy of 61.1%.
Statistical sandhi splitter for agglutinative languages
K PRATHYUSHA,NELAKUDITI KOVIDA,SNEHA NALLANI,Radhika Mamidi
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2015
@inproceedings{bib_Stat_2015, AUTHOR = {PRATHYUSHA, K and KOVIDA, NELAKUDITI and NALLANI, SNEHA and Mamidi, Radhika }, TITLE = {Statistical sandhi splitter for agglutinative languages}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2015}}
Sandhi splitting is a primary and an important step for any natural language processing (NLP) application for languages which have agglutinative morphology. This paper presents a statistical approach to build a sandhi splitter for agglutinative languages. The input to the model is a valid string in the language and the output is a split of that string into meaningful word/s. The approach adopted comprises of two stages namely Segmentation and Word generation, both of which use conditional random fields (CRFs). Our approach is robust and language independent. The results for two Dravidian languages viz. Telugu and Malayalam show an accuracy of 89.07% and 90.50% respectively.
Handling Multi-Sentence Queries in a Domain Independent Dialogue System
PRATHYUSHA JWALAPURAM,Radhika Mamidi
International Conference on Natural Language Processing., ICON, 2015
@inproceedings{bib_Hand_2015, AUTHOR = {JWALAPURAM, PRATHYUSHA and Mamidi, Radhika }, TITLE = {Handling Multi-Sentence Queries in a Domain Independent Dialogue System}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2015}}
This paper discusses the handling of multisentence queries in a mixed-initiative dialogue system based on a hierarchically structured knowledge base, in a way that is domain independent. The system is rule-based and uses dependency relations and part-of-speech tags obtained from the Stanford Parser coupled with the hierarchical structure of the knowledge base to identify the user’s goal. The system was tested for its accuracy over answering questions, and also subjective testing was done to evaluate the dialogue flow; primarily over the books domain. We show examples of the system developed over the domains of books, movies and restaurants to demonstrate the domain independence.
Learning phrase-level vocabulary in second language using pictures/gestures and voice
LAVANYA,DANDA PRATHYUSHA,Radhika Mamidi
International Conference on Natural Language Processing., ICON, 2014
@inproceedings{bib_Lear_2014, AUTHOR = {LAVANYA, and PRATHYUSHA, DANDA and Mamidi, Radhika }, TITLE = {Learning phrase-level vocabulary in second language using pictures/gestures and voice}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2014}}
Earlier works has explored that a foreign language can be learnt from pictures than merely using native language word translations. These efforts on language learning using pictures were limited to colors, animals, birds, vehicles and common noun categories. In this paper, we present our research that learning of verbs is also possible by using right set of pictures and gestures, and we show that this is effective in second language learning. We report the acquisition of words in tourist domain in a second language by working with subjects who are between 14 and 48 years of age. From pre-learning and post-learning evaluations, we show that acquisition of vocabulary like nouns and verbs in a new language is better with the fusion of pictures and voice. We also show the subjects are able to generalize their learning towards phrase-level vocabulary without any additional training and efforts.
Statistical morph analyzer (sma++) for indian languages
CHANDIBHAMAR RAVI HASMUKHBHAI,SRIRAMPUR SAI KRISHNA,Radhika Mamidi
Applying NLP Tools to Similar Languages, Varieties and Dialects., VarDial, 2014
@inproceedings{bib_Stat_2014, AUTHOR = {HASMUKHBHAI, CHANDIBHAMAR RAVI and KRISHNA, SRIRAMPUR SAI and Mamidi, Radhika }, TITLE = {Statistical morph analyzer (sma++) for indian languages}, BOOKTITLE = {Applying NLP Tools to Similar Languages, Varieties and Dialects.}. YEAR = {2014}}
Statistical morph analyzers have proved to be highly accurate while being comparatively easier to maintain than rule based approaches. Our morph analyzer (SMA++) is an improvement over the statistical morph analyzer (SMA) described in Malladi and Mannem (2013). SMA++ predicts the gender, number, person, case (GNPC) and the lemma (L) of a given token. We modified the SMA in Malladi and Mannem (2013), by adding some rich machine learning features. The feature set was chosen specifically to suit the characteristics of Indian Languages. In this paper we apply SMA++ to four Indian languages viz. Hindi, Urdu, Telugu and Tamil. Hindi and Urdu belong to the Indic1 language family. Telugu and Tamil belong to the Dravidian2 language family. We compare SMA++ with some state-of-art statistical morph analyzers viz. Morfette in Chrupała et al. (2008) and SMA in Malladi and Mannem (2013). In all four languages, our system performs better than the above mentioned state-of-art SMAs
Concepts Identification of an NL query in NLIDB Systems
SRIRAMPUR SAI KRISHNA,CHANDIBHAMAR RAVI HASMUKHBHAI,ASHISH PALAKURTHI,Radhika Mamidi
International Conference on Asian Language Processing, IALP, 2014
@inproceedings{bib_Conc_2014, AUTHOR = {KRISHNA, SRIRAMPUR SAI and HASMUKHBHAI, CHANDIBHAMAR RAVI and PALAKURTHI, ASHISH and Mamidi, Radhika }, TITLE = {Concepts Identification of an NL query in NLIDB Systems}, BOOKTITLE = {International Conference on Asian Language Processing}. YEAR = {2014}}
This paper proposes a novel approach to capture the concepts1 of an NL query. Given an NL query, the query is mapped to a tagset, which carries the concept information. The tagset was created by mapping every noun chunk to the attribute of a table (tableName.attributeName) and every verb chunk to a relation in the ER schema. The approach is discussed using the Courses Management domain of a University and can be extended to other domains. The tagset here was formed using the ER-schema of the Courses Management Portal of our university. We used the statistical approach to identify the concepts. We ourselves formed a tagged corpus with different types of NL queries. Conditional Random Field algorithm was used for the classification. The results are very promising and are compared to the rule based approach seen in Gupta et al. (2012) [1] .
Identification of Karaka relations in an English sentence
GORTHI SAI KIRAN,ASHISH PALAKURTHI,Radhika Mamidi,Dipti Mishra Sharma
International Conference on Natural Language Processing., ICON, 2014
@inproceedings{bib_Iden_2014, AUTHOR = {KIRAN, GORTHI SAI and PALAKURTHI, ASHISH and Mamidi, Radhika and Sharma, Dipti Mishra }, TITLE = {Identification of Karaka relations in an English sentence}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2014}}
In this paper we explain the identification of karaka relations in an English sentence. We explain the genesis of the problem and present two different approaches, rule based and statistical. We briefly describe about rule based and focus more on statistical approach. We process a sentence through various stages and extract features at each stage. We train our data and identify Karaka relations using Support Vector Machines (SVM). We also explain the impact of our work on Natural Language Interfaces for Database systems.
Stance classification in online debates by recognizing users’ intentions
RANADE SARVESH AJIT,Rajeev Sangal,Radhika Mamidi
Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL, 2013
@inproceedings{bib_Stan_2013, AUTHOR = {AJIT, RANADE SARVESH and Sangal, Rajeev and Mamidi, Radhika }, TITLE = {Stance classification in online debates by recognizing users’ intentions}, BOOKTITLE = {Annual Meeting of the Special Interest Group on Discourse and Dialogue}. YEAR = {2013}}
Online debate forums provide a rich collection of differing opinions on various topics. In dual-sided debates, users present their opinions or judge other’s opinions to support their stance. In this paper, we examine the use of users’ intentions and debate structure for stance classification of the debate posts. We propose a domain independent approach to capture users’ intent at sentence level using its dependency parse and sentiWordNet and to build the intention structure of the post to identify its stance. To aid the task of classification, we define the health of the debate structure and show that maximizing its value leads to better stance classification accuracies.
Online Debate Summarization using Topic Directed Sentiment Analysis
RANADE SARVESH AJIT,JAYANT GUPTA,Vasudeva Varma Kalidindi,Radhika Mamidi
International Workshop on Issues of Sentiment Discovery and Opinion Mining, WISDOM, 2013
@inproceedings{bib_Onli_2013, AUTHOR = {AJIT, RANADE SARVESH and GUPTA, JAYANT and Kalidindi, Vasudeva Varma and Mamidi, Radhika }, TITLE = {Online Debate Summarization using Topic Directed Sentiment Analysis}, BOOKTITLE = {International Workshop on Issues of Sentiment Discovery and Opinion Mining}. YEAR = {2013}}
Social networking sites provide users a virtual community interaction platform to share their thoughts, life experiences and opinions. Online debate forum is one such platform where people can take a stance and argue in support or opposition of debate topics. An important feature of such forums is that, they are dynamic and increase rapidly. In such situations, e↵ective opinion summarization approaches are needed so that readers need not go through the entire debate. This paper aims to summarize online debates by extracting highly topic relevant and sentiment rich sentences. The proposed approach takes into account topic relevant, document relevant and sentiment based features to capture topic opinionated sentences. ROUGE scores are used to evaluate our system. Our system significantly outperforms several baseline systems and show 5.2% (ROUGE-1), 7.3%(ROUGE-2) and 5.5% (ROUGE-L) improvement over the state-of-the-art opinion summarization system. The results verify that topic directed sentiment features are most important to generate e↵ective debate summaries.
A novel approach towards incorporating context processing capabilities in nlidb system
AKULA ARJUN REDDY,Rajeev Sangal,Radhika Mamidi
International Joint Conference on Natural Language Processing, IJCNLP, 2013
@inproceedings{bib_A_no_2013, AUTHOR = {REDDY, AKULA ARJUN and Sangal, Rajeev and Mamidi, Radhika }, TITLE = {A novel approach towards incorporating context processing capabilities in nlidb system}, BOOKTITLE = {International Joint Conference on Natural Language Processing}. YEAR = {2013}}
This paper presents a novel approach to categorize, model and identify contextual information in natural language interface to database (NLIDB) systems. The interactions between user and system are categorized and modeled based on the way in which the contextual information is utilized in the interactions. A relationship schema among the responses (user and system responses) is proposed. We present a novel method to identify contextual information in one specific type of usersystem interaction. We report on results of experiments with the university related queries.
A template matching approach for detecting pronunciation mismatch
LAVANYA,Radhika Mamidi,Kishore Sunkeshwari Prahallad
International Conference Computational Linguistics Workshops, COLING - W, 2012
@inproceedings{bib_A_te_2012, AUTHOR = {LAVANYA, and Mamidi, Radhika and Prahallad, Kishore Sunkeshwari }, TITLE = {A template matching approach for detecting pronunciation mismatch}, BOOKTITLE = {International Conference Computational Linguistics Workshops}. YEAR = {2012}}
In this paper, we study the usefulness of the best path and the complete trellis in dynamic programming based template matching approach for detecting pronunciation mismatch. We show that there exists cues in trellis (a matrix representing all paths), which could be exploited for detecting pronunciation mismatch. Such an approach could be used to build a template based approach for detecting pronunciation mismatch independent of the language
STATISTICAL SANDHI SPLITTER FOR AGGLUTINATIVE LANGUAGES
K PRATHYUSHA,NELAKUDITI KOVIDA,SNEHA NALLANI,Radhika Mamidi
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2011
@inproceedings{bib_STAT_2011, AUTHOR = {PRATHYUSHA, K and KOVIDA, NELAKUDITI and NALLANI, SNEHA and Mamidi, Radhika }, TITLE = {STATISTICAL SANDHI SPLITTER FOR AGGLUTINATIVE LANGUAGES}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2011}}
Sandhi splitting is a primary and an important step for any natural language processing (NLP) application for languages which have agglutinative morphology. This paper presents a statistical approach to build a sandhi splitter for agglutinative languages. The input to the model is a valid string in the language and the output is a split of that string into meaningful word/s. The approach adopted comprises of two stages namely Segmentation and Word generation, both of which use conditional random fields (CRFs). Our approach is robust and language independent. The results for two Dravidian languages viz. Telugu and Malayalam show an accuracy of 89.07% and 90.50% respectively