@inproceedings{bib_TabX_2025, AUTHOR = {Vihang Pancholi, Bafna Jainit Sushil, Tejas Anvekar, Manish Shrivastava, Vivek Gupta}, TITLE = {TabXEval: Why this is a Bad Table? An eXhaustive Rubric for Table Evaluation}, BOOKTITLE = {Association for Computational Linguistics - Findings}. YEAR = {2025}}
Evaluating tables qualitatively and quantitatively poses a significant challenge, as standard
metrics often overlook subtle structural and
content-level discrepancies. To address this,
we propose a rubric-based evaluation framework that integrates multi-level structural descriptors with fine-grained contextual signals,
enabling more precise and consistent table
comparison. Building on this, we introduce
TabXEval, an eXhaustive and eXplainable
two-phase evaluation framework. TabXEval
first aligns reference and predicted tables structurally via TabAlign, then performs semantic
and syntactic comparison using TabCompare,
offering interpretable and granular feedback.
We evaluate TabXEval on TabXBench, a
diverse, multi-domain benchmark featuring realistic table perturbations and human annotations. A sensitivity-specificity analysis further
demonstrates the robustness and explainability of TabXEval across varied table tasks.
Code and data are available at https://coral-lab-asu.github.io/tabxeval/.
BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages
Shamsuddeen Hassan Muhammad,Nedjma Ousidhoum1,Idris Abdulmumin,Jan Philip Wahle,Terry Ruas,Meriem Beloucif,Christine De Kock,NIRMAL SURANGE,Manish Shrivastava
@inproceedings{bib_BRIG_2025, AUTHOR = {Shamsuddeen Hassan Muhammad, Nedjma Ousidhoum1, Idris Abdulmumin, Jan Philip Wahle, Terry Ruas, Meriem Beloucif, Christine De Kock, NIRMAL SURANGE, Manish Shrivastava}, TITLE = {BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages}, BOOKTITLE = {Association for Computational Linguistics}. YEAR = {2025}}
People worldwide use language in subtle and complex ways to express emotions. Although emotion recognition--an umbrella term for several NLP tasks--impacts various applications within NLP and beyond, most work in this area has focused on high-resource languages. This has led to significant disparities in research efforts and proposed solutions, particularly for under-resourced languages, which often lack high-quality annotated datasets. In this paper, we present BRIGHTER--a collection of multilabeled, emotion-annotated datasets in 28 different languages and across several domains. BRIGHTER primarily covers low-resource languages from Africa, Asia, Eastern Europe, and Latin America, with instances labeled by fluent speakers. We highlight the challenges related to the data collection and annotation processes, and then report experimental results for monolingual and crosslingual multi-label emotion identification, as well as emotion intensity recognition. We analyse the variability in performance across languages and text domains, both with and without the use of LLMs, and show that the BRIGHTER datasets represent a meaningful step towards addressing the gap in text-based emotion recognition.
Lost in Translation? Found in Evaluation: A Comprehensive Survey on Sentence-Level Translation Evaluation
@inproceedings{bib_Lost_2025, AUTHOR = {Ananya Mukherjee, Manish Shrivastava}, TITLE = {Lost in Translation? Found in Evaluation: A Comprehensive Survey on Sentence-Level Translation Evaluation}, BOOKTITLE = {ACM Computing Surveys}. YEAR = {2025}}
Machine Translation (MT) revolutionizes cross-lingual communication but is prone to errors, necessitating thorough evaluation for enhancement. Translation quality can be assessed by humans and automatic evaluation metrics. Human evaluation, though valuable, is costly and subject to limitations in scalability and consistency. While automated metrics supplement manual evaluations, this field still has considerable potential for development. However, there exists prior survey work on automatic evaluation metrics, it is worth noting that most of these are focused on resource-rich languages, leaving a significant gap in evaluating MT outputs across other language families.
To bridge this gap, we present an exhaustive survey, encompassing discussions on MT meta-evaluation datasets, human assessments, and diverse metrics. We categorize both human and automatic evaluation approaches, and offer decision trees to aid in selecting the appropriate approach. Additionally, we evaluate sentences across languages, domains and linguistic features, and further meta-evaluate the metrics by correlating them with human scores.
We critically examine the limitations and challenges inherent in current datasets and evaluation approaches. We propose suggestions for future research aimed at enhancing MT evaluation, including the importance of diverse and well-distributed datasets, the refinement of human evaluation methodologies, and the development of robust metrics that closely align with human judgments.
From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences
Kodali Prashant,Anmol Goel,Likhith Asapu,Vamshi Krishna Bonagiri,Anirudh Govil,Monojit Choudhury,Ponnurangam Kumaraguru,Manish Shrivastava
@inproceedings{bib_From_2025, AUTHOR = {Kodali Prashant, Anmol Goel, Likhith Asapu, Vamshi Krishna Bonagiri, Anirudh Govil, Monojit Choudhury, Ponnurangam Kumaraguru, Manish Shrivastava}, TITLE = {From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences}, BOOKTITLE = {ACM Trasactions on Asian and Low Resource Language Information Processing}. YEAR = {2025}}
Current computational approaches for analysing or generating code-mixed sentences do not explicitly model ``naturalness'' or ``acceptability'' of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi~(en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models when trained solely using code-mixing metrics as features are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, among Encoder models XLM-Roberta and Bernice outperform IndicBERT across different configurations. Among Encoder-Decoder models, mBART performs better than mT5, however Encoder-Decoder models are not able to outperform Encoder-only models. Decoder-only models perform the best when compared to all other MLLMS, with Llama 3.2 - 3B models outperforming similarly sized Qwen, Phi models. Comparison with zero and fewshot capabilitites of ChatGPT show that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from En-Hi to En-Te acceptability judgments are better than random baselines.
@inproceedings{bib_Brid_2025, AUTHOR = {Likhith Asapu, Kodali Prashant, Ashna Dua, Kapil Rajesh Kavitha, Manish Shrivastava}, TITLE = {Bridging Laughter Across Languages: Generation of Hindi-English Code-mixed Puns}, BOOKTITLE = {Workshop on Computational Humor}. YEAR = {2025}}
Puns, as a linguistic phenomenon, hold significant importance in both humor and language comprehension. While extensive research has been conducted in the realm of pun generation in English, there exists a notable gap in the exploration of pun generation within code-mixed text, particularly in Hindi-English code-mixed text. This study addresses this gap by offering a computational method specifically designed to create puns in Hindi-English code-mixed text. In our investigation, we delve into three distinct methodologies aimed at pun generation utilizing pun-alternate word pairs. Furthermore, this novel dataset, HECoP, comprising of 2000 human-annotated sentences serves as a foundational resource for training diverse pun detection models. Additionally, we developed a structured pun generation pipeline capable of generating puns from a single input word without relying on predefined word pairs. Through rigorous human evaluations, our study demonstrates the efficacy of our proposed models in generating code-mixed puns. The findings presented herein lay a solid groundwork for future endeavours in pun generation and computational humor within diverse linguistic contexts.
Srija Mukhopadhyay,Abhishek Rajgaria,Prerana Khatiwada,Manish Shrivastava,Dan Roth,Vivek Gupta
@inproceedings{bib_MAPW_2025, AUTHOR = {Srija Mukhopadhyay, Abhishek Rajgaria, Prerana Khatiwada, Manish Shrivastava, Dan Roth, Vivek Gupta}, TITLE = {MAPWise: Evaluating Vision-Language Models for Advanced Map Queries}, BOOKTITLE = {North American Association for Computational Linguistics}. YEAR = {2025}}
Vision-language models (VLMs) excel at tasks requiring joint understanding of visual and linguistic information. A particularly promising yet under-explored application for these models lies in answering questions based on various kinds of maps. This study investigates the efficacy of VLMs in answering questions based on choropleth maps, which are widely used for data analysis and representation. To facilitate and encourage research in this area, we introduce a novel map-based question-answering benchmark, consisting of maps from three geographical regions (United States, India, China), each containing around 1000 questions. Our benchmark incorporates 43 diverse question templates, requiring nuanced understanding of relative spatial relationships, intricate map features, and complex reasoning. It also includes maps with discrete and continuous values, covering variations in color mapping, category ordering, and stylistic patterns, enabling a comprehensive analysis. We evaluated the performance of multiple VLMs on this benchmark, highlighting gaps in their abilities, and providing insights for improving such models. Our dataset, along with all necessary code scripts, is available at map-wise.github.io
Abhinav S Menon,Manish Shrivastava,David S. Krueger,Ekdeep S. Lubana
@inproceedings{bib_Anal_2025, AUTHOR = {Abhinav S Menon, Manish Shrivastava, David S. Krueger, Ekdeep S. Lubana}, TITLE = {Analyzing (In)Abilities of SAEs via Formal Languages}, BOOKTITLE = {North American Association for Computational Linguistics}. YEAR = {2025}}
Autoencoders have been used for finding inter- pretable and disentangled features underlying neural network representations in both image and text domains. While the efficacy and pitfalls of such methods are well-studied in vision, there is a lack of corresponding results, both qualitative and quantitative, for the text domain. We aim to address this gap by training sparse autoencoders (SAEs) on a synthetic testbed of formal languages. Specifically, we train SAEs on the hidden representations of models trained on formal languages (Dyck-2, Expr, and English PCFG) under a wide variety of hyperparameter settings, finding interpretable latents often emerge in the features learned by our SAEs. However, similar to vision, we find performance turns out to be highly sensitive to inductive biases of the training pipeline. Moreover, we show latents correlating to certain features of the input do not always induce a causal impact on model’s computation. We thus ar- gue that causality has to become a central tar- get in SAE training: learning of causal features should be incentivized from the ground- up. Motivated by this, we propose and perform preliminary investigations for an approach that promotes learning of causally relevant features in our formal language setting.
Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain)
Akshett Rai Jindal,Oota Subba Reddy,Ishani Mondal,Khushbu Pahwa,Satya Sai Srinath Namburi GNVV,Manish Shrivastava,Maneesh Singh,Bapi Raju Surampudi,Manish Gupta
@inproceedings{bib_Corr_2025, AUTHOR = {Akshett Rai Jindal, Oota Subba Reddy, Ishani Mondal, Khushbu Pahwa, Satya Sai Srinath Namburi GNVV, Manish Shrivastava, Maneesh Singh, Bapi Raju Surampudi, Manish Gupta}, TITLE = {Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain)}, BOOKTITLE = {International Conference on Learning Representations}. YEAR = {2025}}
Transformer-based language models, though not explicitly trained to mimic brain recordings, have demonstrated surprising alignment with brain activity. Progress in these models—through increased size, instruction-tuning, and multimodality—has led to better representational alignment with neural data. Recently, a new class of instruction-tuned multimodal LLMs (MLLMs) have emerged, showing remarkable zero-shot capabilities in open-ended multimodal vision tasks. However, it is unknown whether MLLMs, when prompted with natural instructions, lead to better brain alignment and effectively capture instruction-specific representations. To address this, we first investigate the brain alignment, i.e., measuring the degree of predictivity of neural visual activity using text output response embeddings from MLLMs as participants engage in watching natural scenes. Experiments with 10 different instructions (like image captioning, visual question answering, etc.) show that MLLMs exhibit significantly better brain alignment than vision-only models and perform comparably to non-instruction-tuned multimodal models like CLIP. We also find that while these MLLMs are effective at generating high-quality responses suitable to the task-specific instructions, not all instructions are relevant for brain alignment. Further, by varying instructions, we make the MLLMs encode instruction-specific visual concepts related to the input image. This analysis shows that MLLMs effectively capture count-related and recognition-related concepts, demonstrating strong alignment with brain activity. Notably, the majority of the explained variance of the brain encoding models is shared between MLLM embeddings of image captioning and other instructions. These results indicate that enhancing MLLMs' ability to capture more task-specific information could allow for better differentiation between various types of instructions, and hence improve their precision in predicting brain responses.
Why should only High-Resource-Languages have all the fun? Pivot Based Evaluation in Low Resource Setting
@inproceedings{bib_Why__2025, AUTHOR = {Ananya Mukherjee, Saumitra Yadav, Manish Shrivastava}, TITLE = {Why should only High-Resource-Languages have all the fun? Pivot Based Evaluation in Low Resource Setting}, BOOKTITLE = {International Conference on Computational Linguistics}. YEAR = {2025}}
Evaluating machine translation (MT) systems for low-resource languages has long been a challenge due to the limited availability of evaluation metrics and resources. As a result, researchers in this space have relied primarily on lexical-based metrics like BLEU, TER, and ChrF, which lack semantic evaluation. In this first-of-its-kind work, we propose a novel pivot-based evaluation framework that addresses these limitations; after translating low-resource language outputs into a related high-resource language, we leverage advanced neural and embedding-based metrics for more meaningful evaluation. Through a series of experiments using five low-resource languages: Assamese, Manipuri, Kannada, Bhojpuri, and Nepali, we demonstrate how this method extends the coverage of both lexical-based and embedding-based metrics, even for languages not directly supported by advanced metrics. Our results show that the differences between direct and pivot-based evaluation scores are minimal, proving that this approach is a viable and effective solution for evaluating translations in endangered and low-resource languages. This work paves the way for more inclusive, accurate, and scalable MT evaluation for underrepresented languages, marking a significant step forward in this under-explored area of research.
@inproceedings{bib_CoST_2024, AUTHOR = {Ananya Mukherjee, Saumitra Yadav, Manish Shrivastava}, TITLE = {CoST of breaking the LLMs}, BOOKTITLE = {Conference on Machine Translation}. YEAR = {2024}}
This paper presents an evaluation of 16 machine translation systems submitted to the Shared Task of the 9th Conference of Machine Translation (WMT24) for the English-Hindi (en-hi) language pair using our Complex Structures Test (CoST) suite. Aligning with this year’s test suite sub-task theme, “Help us break LLMs”, we curated a comprehensive test suite encompassing diverse datasets across various categories, including autobiography, poetry, legal, conversation, play, narration, technical, and mixed genres. Our evaluation reveals that all the systems struggle significantly with the archaic style of text like legal and technical writings or text with creative twist like conversation and poetry datasets, highlighting their weaknesses in handling complex linguistic structures and stylistic nuances inherent in these text types. Our evaluation identifies the strengths and limitations of the submitted models, pointing to specific areas where further research and development are needed to enhance their performance. Our test suite is available at https://github.com/AnanyaCoder/CoST-WMT-24-Test-Suite-Task.
chrF-S: Semantics Is All You Need
Ananya Mukherjee,Manish Shrivastava
Conference on Machine Translation, WMT, 2024
@inproceedings{bib_chrF_2024, AUTHOR = {Ananya Mukherjee, Manish Shrivastava}, TITLE = {chrF-S: Semantics Is All You Need}, BOOKTITLE = {Conference on Machine Translation}. YEAR = {2024}}
Machine translation (MT) evaluation metrics like BLEU and chrF++ are widely used reference-based metrics that do not require training and are language-independent. However, these metrics primarily focus on n-gram matching and often overlook semantic depth and contextual understanding. To address this gap, we introduce chrF-S (Semantic chrF++), an enhanced metric that integrates sentence embeddings to evaluate translation quality more comprehensively. By combining traditional character and word n-gram analysis with semantic information derived from embeddings, chrF-S captures both syntactic accuracy and sentence-level semantics. This paper presents our contributions to the WMT24 shared metrics task, showcasing our participation and the development of chrF-S. We also demonstrate that, according to preliminary results on the leaderboard, our metric performs on par with other supervised and LLM-based metrics. By merging semantic insights with n-gram precision, chrF-S offers a significant enhancement in the assessment of machine-generated translations, advancing the field of MT evaluation. Our code and data will be made available at https://github.com/AnanyaCoder/chrF-S.
A3-108 Controlling Token Generation in Low Resource Machine Translation Systems
Saumitra Yadav,Ananya Mukherjee,Manish Shrivastava
Conference on Machine Translation, WMT, 2024
@inproceedings{bib_A3-1_2024, AUTHOR = {Saumitra Yadav, Ananya Mukherjee, Manish Shrivastava}, TITLE = {A3-108 Controlling Token Generation in Low Resource Machine Translation Systems}, BOOKTITLE = {Conference on Machine Translation}. YEAR = {2024}}
Translating for languages with limited resources poses a persistent challenge due to the scarcity of high-quality training data. To enhance translation accuracy, we explored controlled generation mechanisms, focusing on the importance of control tokens. In our experiments, while training, we encoded the target sentence length as a control token to the source sentence, treating it as an additional feature for the source sentence. We developed various NMT models using transformer architecture and conducted experiments across 8 language directions (English = Assamese, Manipuri, Khasi, and Mizo), exploring four variations of length encoding mechanisms. Through comparative analysis against the baseline model, we submitted two systems for each language direction. We report our findings for the same in this work.
Knowledge-Aware Reasoning over Multimodal Semi-structured Tables
Suyash Vardhan Mathur,Kunal Kartik,Bafna Jainit Sushil,Harshita Khandelwal,Manish Shrivastava,Vivek Gupta,Mohit Bansal,Dan Roth
Empirical Methods in Natural Language Processing-Findings, EMNLP-F, 2024
@inproceedings{bib_Know_2024, AUTHOR = {Suyash Vardhan Mathur, Kunal Kartik, Bafna Jainit Sushil, Harshita Khandelwal, Manish Shrivastava, Vivek Gupta, Mohit Bansal, Dan Roth}, TITLE = {Knowledge-Aware Reasoning over Multimodal Semi-structured Tables}, BOOKTITLE = {Empirical Methods in Natural Language Processing-Findings}. YEAR = {2024}}
Existing datasets for tabular question answering typically focus exclusively on text within cells. However, real-world data is inherently multimodal, often blending images such as symbols, faces, icons, patterns, and charts with textual content in tables. With the evolution of AI models capable of multimodal reasoning, it is pertinent to assess their efficacy in handling such structured data. This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data. We explore their ability to reason on tables that integrate both images and text, introducing MMTabQA, a new dataset designed for this purpose. Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs, understanding visual context, and comparing visual content across images. These findings establish our dataset as a robust benchmark for advancing AI's comprehension and capabilities in analyzing multimodal structured data.
SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages
Manish Shrivastava,NIRMAL SURANGE
Findings of the Association for Computational Linguistics, FACL, 2024
Abs | | bib Tex
@inproceedings{bib_SemR_2024, AUTHOR = {Manish Shrivastava, NIRMAL SURANGE}, TITLE = {SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages}, BOOKTITLE = {Findings of the Association for Computational Linguistics}. YEAR = {2024}}
Exploring and quantifying semantic relatedness is central to representing language and
holds significant implications across various
NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we
instead investigate the broader phenomenon
of semantic relatedness. In this paper, we
present SemRel, a new semantic relatedness
dataset collection annotated by native speakers
across 13 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic,
Modern Standard Arabic, Spanish, and Telugu.
These languages originate from five distinct
language families and are predominantly spoken in Africa and Asia – regions characterised
by a relatively limited availability of NLP resources. Each instance in the SemRel datasets
is a sentence pair associated with a score that
represents the degree of semantic textual relatedness between the two sentences. The scores
are obtained using a comparative annotation
framework. We describe the data collection and
annotation processes, challenges when building the datasets, baseline experiments, and their
impact and utility in NLP.
SemEval Task 1: Semantic Textual Relatedness for African and Asian Languages
Nedjma Ousidhoum1,Mohamed Abdalla,NIRMAL SURANGE,Oumaima Hourrane,Sanchit Ahuja,Seid Muhie Yimam,Saif M. Mohammad,Shamsuddeen Hassan Muhammad,Alham Fikri Aji,Christine De Kock,Ibrahim Said Ahmad,Idris Abdulmumin,Krishnapriya Vishnubhotla,Manish Shrivastava,Meriem Beloucif
International Workshop on Semantic Evaluation, SemEval, 2024
@inproceedings{bib_SemE_2024, AUTHOR = {Nedjma Ousidhoum1, Mohamed Abdalla, NIRMAL SURANGE, Oumaima Hourrane, Sanchit Ahuja, Seid Muhie Yimam, Saif M. Mohammad, Shamsuddeen Hassan Muhammad, Alham Fikri Aji, Christine De Kock, Ibrahim Said Ahmad, Idris Abdulmumin, Krishnapriya Vishnubhotla, Manish Shrivastava, Meriem Beloucif}, TITLE = {SemEval Task 1: Semantic Textual Relatedness for African and Asian Languages}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2024}}
We present the first shared task on Semantic Textual Relatedness (STR). While earlier shared tasks primarily focused on semantic similarity, we instead investigate the broader phenomenon of semantic relatedness across 14 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia – regions characterised by the relatively limited availability of NLP resources. Each instance in the datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. Participating systems were asked to rank sentence pairs by their closeness in meaning (i.e., their degree of semantic relatedness) in the 14 languages in three main tracks: (a) supervised, (b) unsupervised, and (c) crosslingual. The task attracted 163 participants. We received 70 submissions in total (across all tasks) from 51 different teams, and 38 system description papers. We report on the best-performing systems as well as the most common and the most effective approaches for the three different tracks.
Maha Bhaashya at SemEval-2024 Task 6: Zero-Shot Multi-task Hallucination Detection
Malladi Bhaskararama Sahishna Advaith,Patanjali Bhamidipati,Manish Shrivastava,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2024
@inproceedings{bib_Maha_2024, AUTHOR = {Malladi Bhaskararama Sahishna Advaith, Patanjali Bhamidipati, Manish Shrivastava, Radhika Mamidi}, TITLE = {Maha Bhaashya at SemEval-2024 Task 6: Zero-Shot Multi-task Hallucination Detection}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2024}}
In recent studies, the extensive utilization of large language models has underscored the im- portance of robust evaluation methodologies for assessing text generation quality and rel- evance to specific tasks. This has revealed a prevalent issue known as hallucination, an emergent condition in the model where gener- ated text lacks faithfulness to the source and deviates from the evaluation criteria. In this study, we formally define hallucination and pro- pose a framework for its quantitative detection in a zero-shot setting, leveraging our definition and the assumption that model outputs entail task and sample specific inputs. In detecting hallucinations, our solution achieves an accu- racy of 0.78 in a model-aware setting and 0.61 in a model-agnostic setting. Notably, our so- lution maintains computational efficiency, re- quiring far less computational resources than other SOTA approaches, aligning with the trend towards lightweight and compressed models.
LastResort at SemEval-2024 Task 3: Exploring Multimodal Emotion Cause Pair Extraction as Sequence Labelling Task
Suyash Vardhan Mathur,Akshett Rai Jindal,Hardik Mittal,Manish Shrivastava
International Workshop on Semantic Evaluation, SemEval, 2024
@inproceedings{bib_Last_2024, AUTHOR = {Suyash Vardhan Mathur, Akshett Rai Jindal, Hardik Mittal, Manish Shrivastava}, TITLE = {LastResort at SemEval-2024 Task 3: Exploring Multimodal Emotion Cause Pair Extraction as Sequence Labelling Task}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2024}}
Conversation is the most natural form of human communication, where each utterance can range over a variety of possible emotions. While significant work has been done towards the detection of emotions in text, relatively little work has been done towards finding the cause of the said emotions, especially in multimodal settings. SemEval 2024 introduces the task of Multimodal Emotion Cause Analysis in Conversations, which aims to extract emotions reflected in individual utterances in a conversation involving multiple modalities (textual, audio, and visual modalities) along with the corresponding utterances that were the cause for the emotion. In this paper, we propose models that tackle this task as an utterance labeling and a sequence labeling problem and perform a comparative study of these models, involving baselines using different encoders, using BiLSTM for adding contextual information of the conversation, and finally adding a CRF layer to try to model the inter-dependencies between adjacent utterances more effectively. In the official leaderboard for the task, our architecture was ranked 8th, achieving an F1-score of 0.1759 on the leaderboard.
DaVinci at SemEval-2024 Task 9: Few-shot prompting GPT-3.5 for Unconventional Reasoning
Suyash Vardhan Mathur,Akshett Rai Jindal,Manish Shrivastava
International Workshop on Semantic Evaluation, SemEval, 2024
@inproceedings{bib_DaVi_2024, AUTHOR = {Suyash Vardhan Mathur, Akshett Rai Jindal, Manish Shrivastava}, TITLE = {DaVinci at SemEval-2024 Task 9: Few-shot prompting GPT-3.5 for Unconventional Reasoning}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2024}}
While significant work has been done in the field of NLP on vertical thinking, which involves primarily logical thinking, little work has been done towards lateral thinking, which involves looking at problems from an unconventional perspective and defying existing conceptions and notions. Towards this direction, SemEval 2024 introduces the task of BRAINTEASER, which involves two types of questions -- Sentence Puzzles and Word Puzzles that defy conventional common-sense reasoning and constraints. In this paper, we tackle both types of questions using few-shot prompting on GPT-3.5 and gain insights regarding the difference in the nature of the two types. Our prompting strategy placed us 26th on the leaderboard for the Sentence Puzzle and 15th on the Word Puzzle task.
Mast Kalandar at SemEval-2024 Task 8: On the Trail of Textual Origins: RoBERTa-BiLSTM Approach to Detect AI-Generated Text
Bafna Jainit Sushil,Hardik Mittal,Suyash Sethia,Manish Shrivastava,Radhika Mamidi
International Workshop on Semantic Evaluation, SemEval, 2024
@inproceedings{bib_Mast_2024, AUTHOR = {Bafna Jainit Sushil, Hardik Mittal, Suyash Sethia, Manish Shrivastava, Radhika Mamidi}, TITLE = {Mast Kalandar at SemEval-2024 Task 8: On the Trail of Textual Origins: RoBERTa-BiLSTM Approach to Detect AI-Generated Text}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2024}}
Large Language Models (LLMs) have showcased impressive abilities in generating fluent responses to diverse user queries. However, concerns regarding the potential misuse of such texts in journalism, educational, and academic contexts have surfaced. SemEval 2024 introduces the task of Multigenerator, Multidomain, and Multilingual Black-Box MachineGenerated Text Detection, aiming to develop automated systems for identifying machinegenerated text and detecting potential misuse. In this paper, we i) propose a RoBERTaBiLSTM based classifier designed to classify text into two categories: AI-generated or human ii) conduct a comparative study of our model with baseline approaches to evaluate its effectiveness. This paper contributes to the advancement of automatic text detection systems in addressing the challenges posed by machinegenerated text misuse. In the official leaderboard for the task, our architecture was ranked 46th, achieving an accuracy of 0.8083 on the leaderboard.
TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu
Kanumolu Gopichand,Madasu Lokesh,NIRMAL SURANGE,Manish Shrivastava
International Conference on Computational Linguistics, COLING, 2024
@inproceedings{bib_TeCl_2024, AUTHOR = {Kanumolu Gopichand, Madasu Lokesh, NIRMAL SURANGE, Manish Shrivastava}, TITLE = {TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu}, BOOKTITLE = {International Conference on Computational Linguistics}. YEAR = {2024}}
News headline generation is a crucial task in increasing productivity for both the readers and producers of news. This task can easily be aided by automated News headline-generation models. However, the presence of irrelevant headlines in scraped news articles results in sub-optimal performance of generation models. We propose that relevance-based headline classification can greatly aid the task of generating relevant headlines. Relevance-based headline classification involves categorizing news headlines based on their relevance to the corresponding news articles. While this task is well-established in English, it remains under-explored in low-resource languages like Telugu due to a lack of annotated data. To address this gap, we present TeClass, the first-ever human-annotated Telugu news headline classification dataset, containing 78,534 annotations across 26,178 article-headline pairs. We experiment with various baseline models and provide a comprehensive analysis of their results. We further demonstrate the impact of this work by fine-tuning various headline generation models using TeClass dataset. The headlines generated by the models fine-tuned on highly relevant article-headline pairs, showed about a 5 point increment in the ROUGE-L scores. To encourage future research, the annotated dataset as well as the annotation guidelines will be made publicly available.
Hindi Causal TimeBank: an Annotated Causal Event Corpus
Tanvi Kamble,Manish Shrivastava
International Conference on Natural Language Processing., ICON, 2023
@inproceedings{bib_Hind_2023, AUTHOR = {Tanvi Kamble, Manish Shrivastava}, TITLE = {Hindi Causal TimeBank: an Annotated Causal Event Corpus}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2023}}
Events and states have gained importance in NLP and information retrieval for being semantically rich temporal and spatial information indicators. Event causality helps us identify which events are necessary for another event to occur. The cause-effect event pairs can be relevant for multiple NLP tasks like question answering, summarization, etc. Multiple efforts have been made to identify causal events in documents but very little work has been done in this field in the Hindi language. We create an annotated corpus for detecting and classifying causal event relations on top of the Hindi Timebank (Goel et al., 2020), the ‘Hindi Causal Timebank’ (Hindi CTB). We introduce semantic causal relations like Purpose, Reason, and Enablement inspired from Bejan and Harabagiu (2008)’s annotation scheme and add some special cases particular to Hindi language.
A Survey of using Large Language Models for Generating Infrastructure as Code
Kalahasti Ganesh Srivatsa,Sabyasachi Muhopadhyay,KATRAPATI GANESH SASANK,Manish Shrivastava
International Conference on Natural Language Processing., ICON, 2023
@inproceedings{bib_A_Su_2023, AUTHOR = {Kalahasti Ganesh Srivatsa, Sabyasachi Muhopadhyay, KATRAPATI GANESH SASANK, Manish Shrivastava}, TITLE = {A Survey of using Large Language Models for Generating Infrastructure as Code}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2023}}
Infrastructure as Code (IaC) is a revolutionary approach which has gained significant prominence in the Industry. IaC manages and provisions IT infrastructure using machine-readable code by enabling automation, consistency across the environments, reproducibility, version control, error reduction and enhancement in scalability. However, IaC orchestration is often a painstaking effort which requires specialised skills as well as a lot of manual effort. Automation of IaC is a necessity in the present conditions of the Industry and in this survey, we study the feasibility of applying Large Language Models (LLM) to address this problem. LLMs are large neural network-based models which have demonstrated significant language processing abilities and shown to be capable of following a range of instructions within a broad scope. Recently, they have also been adapted for code understanding and generation tasks successfully, which makes them a promising choice for the automatic generation of IaC configurations. In this survey, we delve into the details of IaC, usage of IaC in different platforms, their challenges, LLMs in terms of code-generation aspects and the importance of LLMs in IaC along with our own experiments. Finally, we conclude by presenting the challenges in this area and highlighting the scope for future research.
Mukhyansh: A Headline Generation Dataset for Indic Languages
Madasu Lokesh,Kanumolu Gopichand,NIRMAL SURANGE,Manish Shrivastava
Pacific Asia Conference on Language, Information and Computation, PACLIC, 2023
@inproceedings{bib_Mukh_2023, AUTHOR = {Madasu Lokesh, Kanumolu Gopichand, NIRMAL SURANGE, Manish Shrivastava}, TITLE = {Mukhyansh: A Headline Generation Dataset for Indic Languages}, BOOKTITLE = {Pacific Asia Conference on Language, Information and Computation}. YEAR = {2023}}
The task of headline generation within the realm of Natural Language Processing (NLP) holds immense significance, as it strives to distill the true essence of textual content into concise and attention-grabbing summaries. While noteworthy progress has been made in headline generation for widely spoken languages like English, there persist numerous challenges when it comes to generating headlines in lowresource languages, such as the rich and diverse Indian languages. A prominent obstacle that specifically hinders headline generation in Indian languages is the scarcity of high-quality annotated data. To address this crucial gap, we proudly present Mukhyansh, an extensive multilingual dataset, tailored for Indian language headline generation. Comprising an impressive collection of over 3.39 million article-headline pairs, Mukhyansh spans across eight prominent Indian languages, namely Telugu, Tamil, Kannada, Malayalam, Hindi, Bengali, Marathi, and Gujarati. We present a comprehensive evaluation of several state-of-the-art baseline models. Additionally, through an empirical analysis of existing works, we demonstrate that Mukhyansh outperforms all other models, achieving an impressive average ROUGE-L score of 31.43 across all 8 languages
LTRC_IIITH’s 2023 Submission for Prompting Large Language Models as Explainable Metrics Task
Baswani Pavan,Ananya Mukherjee,Manish Shrivastava
International Joint Conference on Natural Language Processing Workshop, IJCNLP-W, 2023
@inproceedings{bib_LTRC_2023, AUTHOR = {Baswani Pavan, Ananya Mukherjee, Manish Shrivastava}, TITLE = {LTRC_IIITH’s 2023 Submission for Prompting Large Language Models as Explainable Metrics Task}, BOOKTITLE = {International Joint Conference on Natural Language Processing Workshop}. YEAR = {2023}}
In this report, we share our contribution to the Eval4NLP Shared Task titled "Prompting Large Language Models as Explainable Metrics." We build our prompts with a primary focus on effective prompting strategies, score-aggregation, and explainability for LLM-based metrics. We participated in the track for smaller models by submitting the scores along with their explanations. According to the Kendall correlation scores on the leaderboard, our MT evaluation submission ranks second-best, while our summarization evaluation submission ranks fourth, with only a 0.06 difference from the leading submission.
Fine-grained Contract NER using instruction based model
Hiranmai Sri Adibhatla,Baswani Pavan,Manish Shrivastava
Pacific Asia Conference on Language, Information and Computation, PACLIC, 2023
@inproceedings{bib_Fine_2023, AUTHOR = {Hiranmai Sri Adibhatla, Baswani Pavan, Manish Shrivastava}, TITLE = {Fine-grained Contract NER using instruction based model}, BOOKTITLE = {Pacific Asia Conference on Language, Information and Computation}. YEAR = {2023}}
Lately, instruction-based techniques have made significant strides in improving performance in few-shot learning scenarios. They achieve this by bridging the gap between pre-trained language models and fine-tuning for specific downstream tasks. Despite these advancements, the performance of Large Language Models (LLMs) in information extraction tasks like Named Entity Recognition (NER), using prompts or instructions, still falls short of supervised baselines. The reason for this performance gap can be attributed to the fundamental disparity between NER and LLMs. NER is inherently a sequence labeling task, where the model must assign entity-type labels to individual tokens within a sentence. In contrast, LLMs are designed as a text generation task. This distinction between semantic labeling and text generation leads to subpar performance. In this paper, we transform the NER task into a text-generation task that can be readily adapted by LLMs. This involves enhancing source sentences with task-specific instructions and answer choices, allowing for the identification of entities and their types within natural language. We harness the strength of LLMs by integrating supervised learning within them. The goal of this combined strategy is to boost the performance of LLMs in extraction tasks like NER while simultaneously addressing hallucination issues often observed in LLM-generated content. A novel corpus Contract NER comprising seven frequently observed contract categories, encompassing named entities associated with 18 distinct legal entity types is released along with our baseline models. Our models and dataset are available to the community for future research
IIIT HYD’s Submission for WMT23 Test-suite task
Ananya Mukherjee,Manish Shrivastava
Conference on Machine Translation, WMT, 2023
@inproceedings{bib_IIIT_2023, AUTHOR = {Ananya Mukherjee, Manish Shrivastava}, TITLE = {IIIT HYD’s Submission for WMT23 Test-suite task}, BOOKTITLE = {Conference on Machine Translation}. YEAR = {2023}}
Proceedings of the Eighth Conference on Machine Translation (WMT), pages 246–251 December 6–7, 2023. ©2023 Association for Computational Linguistics 246IIIT HYD’s Submission for WMT23 Test-suite task Ananya Mukherjee and Manish Shrivastava Machine Translation - Natural Language Processing Lab Language Technologies Research Centre International Institute of Information Technology - Hyderabad ananya.mukherjee@research.iiit.ac.in m.shrivastava@iiit.ac.in Abstract This paper summarizes the results of our test suite evaluation on 12 machine translation systems submitted at the Shared Task of the 8th Conference of Machine Translation (WMT23) for English-German (en-de) language pair. Our test suite covers five specific domains (entertainment, environ- ment, health, science, legal) and spans five distinct writing styles (descriptive, judgments, narrative, reporting, technical-writing). We present our analysis through automatic evaluation methods, conducted with a focus on domain-specific and writing style-specific evaluations.
MEE4 and XLsim : IIIT HYD’s Submissions’ for WMT23 Metrics Shared Task
Ananya Mukherjee,Manish Shrivastava
Conference on Machine Translation, WMT, 2023
@inproceedings{bib_MEE4_2023, AUTHOR = {Ananya Mukherjee, Manish Shrivastava}, TITLE = {MEE4 and XLsim : IIIT HYD’s Submissions’ for WMT23 Metrics Shared Task}, BOOKTITLE = {Conference on Machine Translation}. YEAR = {2023}}
This paper presents our contributions to the WMT2023 shared metrics task, consisting of two distinct evaluation approaches: a) Unsu- pervised Metric (MEE4) and b) Supervised Metric (XLSim). MEE4 represents an unsu- pervised, reference-based assessment metric that quantifies linguistic features, encompass- ing lexical, syntactic, semantic, morphologi- cal, and contextual similarities, leveraging em- beddings. In contrast, XLsim is a supervised reference-based evaluation metric, employing a Siamese Architecture, which regresses on Di- rect Assessments (DA) from previous WMT News Translation shared tasks from 2017-2022. XLsim is trained using XLM-RoBERTa (base) on English-German reference and mt pairs with human scores. Here are the links for MEE4 1 and XLsim2 metrics.
A Computational Algebraic Analysis of Hindi Syntax
Alok Debnath,Manish Shrivastava
Journal of Logic, Language and Information, JLLI, 2023
Abs | | bib Tex
@inproceedings{bib_A_Co_2023, AUTHOR = {Alok Debnath, Manish Shrivastava}, TITLE = {A Computational Algebraic Analysis of Hindi Syntax}, BOOKTITLE = {Journal of Logic, Language and Information}. YEAR = {2023}}
In this paper, we present a computational algebraic representation of Hindi syntax. This paper is the first attempt to establish the representation of various facets of Hindi syntax into algebra, including dual nominative/ergative behavior, a syntacto-semantic case system and complex agreement rules between the noun and verb phrase. Using the pregroup analysis framework, we show how we represent morphological type reduction for morphological behavior of lexical markers, the representation of causative constructions which are morphologically affixed, as well as of light verb constructions which form the verb by joint predication. We present examples adapted from the Hindi Dependency Treebank to show the pregroup analysis of Hindi sentences.
Mukhyansh: A Headline Generation Dataset for Indic Languages
Madasu Lokesh,Kanumolu Gopichand,NIRMAL SURANGE,Manish Shrivastava
Technical Report, arXiv, 2023
@inproceedings{bib_Mukh_2023, AUTHOR = {Madasu Lokesh, Kanumolu Gopichand, NIRMAL SURANGE, Manish Shrivastava}, TITLE = {Mukhyansh: A Headline Generation Dataset for Indic Languages}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
The task of headline generation within the realm of Natural Language Processing (NLP) holds immense significance, as it strives to dis- till the true essence of textual content into con- cise and attention-grabbing summaries. While noteworthy progress has been made in head- line generation for widely spoken languages like English, there persist numerous challenges when it comes to generating headlines in low- resource languages, such as the rich and diverse Indian languages. A prominent obstacle that specifically hinders headline generation in In- dian languages is the scarcity of high-quality annotated data. To address this crucial gap, we proudly present Mukhyansh, an extensive mul- tilingual dataset, tailored for Indian language headline generation. Comprising an impressive collection of over 3.39 million article-headline pairs, Mukhyansh spans across eight promi- nent Indian languages, namely Telugu, Tamil, Kannada, Malayalam, Hindi, Bengali, Marathi, and Gujarati. We present a comprehensive evaluation of several state-of-the-art baseline models. Additionally, through an empirical analysis of existing works, we demonstrate that Mukhyansh outperforms all other mod- els, achieving an impressive average ROUGE-L score of 31.43 across all 8 languages.
Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really Need Reference?
Kanumolu Gopichand,Madasu Lokesh,Baswani Pavan,Ananya Mukherjee,Manish Shrivastava
International Joint Conference on Natural Language Processing Workshop, IJCNLP-W, 2023
@inproceedings{bib_Unsu_2023, AUTHOR = {Kanumolu Gopichand, Madasu Lokesh, Baswani Pavan, Ananya Mukherjee, Manish Shrivastava}, TITLE = {Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really Need Reference?}, BOOKTITLE = {International Joint Conference on Natural Language Processing Workshop}. YEAR = {2023}}
Fluency is a crucial goal of all Natural Language Generation (NLG) systems. Widely used automatic evaluation metrics fall short in capturing the fluency of machine-generated text. Assessing the fluency of NLG systems poses a challenge since these models are not limited to simply reusing words from the input but may also generate abstractions. Existing reference-based fluency evaluations, such as word overlap measures, often exhibit weak correlations with human judgments. This paper adapts an existing unsupervised technique for measuring text fluency without the need for any reference. Our approach leverages various word embeddings and trains language models using Recurrent Neural Network (RNN) architectures. We also experiment with other available multilingual Language Models (LMs). To assess the performance of the models, we conduct a comparative analysis across 10 Indic languages, correlating the obtained fluency scores with human judgments.
The WEAVE 2.0 Corpus: Role Labelled Synthetic Chemical Procedures from Patents with Chemical Named Entities
Shubhangi Dutta,Manish Shrivastava,Prabhakar Bhimalapuram
Pacific Asia Conference on Language, Information and Computation, PACLIC, 2023
@inproceedings{bib_The__2023, AUTHOR = {Shubhangi Dutta, Manish Shrivastava, Prabhakar Bhimalapuram}, TITLE = {The WEAVE 2.0 Corpus: Role Labelled Synthetic Chemical Procedures from Patents with Chemical Named Entities}, BOOKTITLE = {Pacific Asia Conference on Language, Information and Computation}. YEAR = {2023}}
Discovering new reaction pathways lies at the heart of drug discovery and chemical experimentation. A huge amount of drug reaction data lies in unannotated patent texts which are not machine readable. Reaction roles play an important role in analysing chemical pathways, and tracing chemicals through them, and while there is a vast body of chemical data available, the unavailability of reaction role annotated data is a blocker to effectively deploy deep learning methods for reaction discovery. This paper introduces a new dataset, WEAVE 2.0, obtained from chemical patents, along with full, manual, annotations of novel chemical reactions with reaction role information. We also provide baseline and state of the art models for chemical entity recognition from our raw dataset. Our dataset and associated models form the foundation of neural understanding of chemical reaction pathways via reaction roles.
MultiFacet: A Multi-Tasking Framework for Speech-to-Sign Language Generation
Mounika K,Shantanu Singh,Manish Shrivastava
International Conference on Multimodal Interaction, ICMI, 2023
Abs | | bib Tex
@inproceedings{bib_Mult_2023, AUTHOR = {Mounika K, Shantanu Singh, Manish Shrivastava}, TITLE = {MultiFacet: A Multi-Tasking Framework for Speech-to-Sign Language Generation}, BOOKTITLE = {International Conference on Multimodal Interaction}. YEAR = {2023}}
Sign language is a rich form of communication, uniquely conveying meaning through a combination of gestures, facial expressions, and body movements. Existing research in sign language generation has predominantly focused on text-to-sign pose generation, while speech-to-sign pose generation remains relatively underexplored. Speech-to-sign language generation models can facilitate effective communication between the deaf and hearing communities. In this paper, we propose an architecture that utilises prosodic information from speech audio and semantic context from text to generate sign pose sequences. In our approach, we adopt a multi-tasking strategy that involves an additional task of predicting Facial Action Units (FAUs). FAUs capture the intricate facial muscle movements that play a crucial role in conveying specific facial expressions during sign language generation.
X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents
Mehrad Moradshahi,Tianhao Shen,Kalika Bali,Monojit Choudhury,Gaël de Chalendar,Anmol Goel,Kodali Prashant,Ponnurangam Kumaraguru,Manish Shrivastava
Technical Report, arXiv, 2023
@inproceedings{bib_X-Ri_2023, AUTHOR = {Mehrad Moradshahi, Tianhao Shen, Kalika Bali, Monojit Choudhury, Gaël De Chalendar, Anmol Goel, Kodali Prashant, Ponnurangam Kumaraguru, Manish Shrivastava}, TITLE = {X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed EnglishHindi language. X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents. The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with
PrecogIIITH@WASSA2023: Emotion Detection for Urdu-English Code-mixed Text
Vedula Bhaskara Hanuma,Kodali Prashant,Manish Shrivastava,Ponnurangam Kumaraguru
Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, WASSA, 2023
@inproceedings{bib_Prec_2023, AUTHOR = {Vedula Bhaskara Hanuma, Kodali Prashant, Manish Shrivastava, Ponnurangam Kumaraguru}, TITLE = {PrecogIIITH@WASSA2023: Emotion Detection for Urdu-English Code-mixed Text}, BOOKTITLE = {Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis}. YEAR = {2023}}
Code-mixing refers to the phenomenon of using two or more languages interchangeably within a speech or discourse context. This practice is particularly prevalent on social media platforms, and determining the embedded affects in a code-mixed sentence remains as a challenging problem. In this submission we describe our system for WASSA 2023 Shared Task on Emotion Detection in English-Urdu code-mixed text. In our system we implement a multiclass emotion detection model with label space of 11 emotions. Samples are code-mixed English-Urdu text, where Urdu is written in romanised form. Our submission is limited to one of the subtasks - Multi Class classification and we leverage transformer-based Multilingual Large Language Models (MLLMs), XLM-RoBERTa and Indic-BERT. We fine-tune MLLMs on the released data splits, with and without pre-processing steps (translation to english), for classifying texts into the appropriate emotion category. Our methods did not surpass the baseline, and our submission is ranked sixth overall.
LTRC at SemEval-2023 Task 6: Experiments with Ensemble Embeddings
Baswani Pavan,Hiranmai Sri Adibhatla,Manish Shrivastava
International Workshop on Semantic Evaluation, SemEval, 2023
@inproceedings{bib_LTRC_2023, AUTHOR = {Baswani Pavan, Hiranmai Sri Adibhatla, Manish Shrivastava}, TITLE = {LTRC at SemEval-2023 Task 6: Experiments with Ensemble Embeddings}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2023}}
In this paper, we present our team’s involvement in Task 6: LegalEval: Understanding Legal Texts. The task comprised three subtasks, and we focus on subtask A: Rhetorical Roles prediction. Our approach included experimenting with pre-trained embeddings and refining them with statistical and neural classifiers. We provide a thorough examination ofour experiments, solutions, and analysis, culminating in our best-performing model and current progress. We achieved a micro F1 score of 0.6133 on the test data using fine-tuned LegalBERT embeddings.
PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India
Urlana Ashok,Pinzhen Chen,Zheng Zhao,Shay B. Cohen,Manish Shrivastava,Barry Haddow
Technical Report, arXiv, 2023
@inproceedings{bib_PMIn_2023, AUTHOR = {Urlana Ashok, Pinzhen Chen, Zheng Zhao, Shay B. Cohen, Manish Shrivastava, Barry Haddow}, TITLE = {PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
This paper introduces PMIndiaSum, a new multilingual and massively parallel headline summarization corpus focused on languages in India. Our corpus covers four language families, 14 languages, and the largest to date, 196 language pairs. It provides a testing ground for all cross-lingual pairs. We detail our workflow to construct the corpus, including data acquisition, processing, and quality assurance. Furthermore, we publish benchmarks for monolingual, cross-lingual, and multilingual summarization by fine-tuning, prompting, as well as translate-and-summarize. Experimental results confirm the crucial role of our data in aiding the summarization of Indian texts. Our dataset is publicly available and can be freely modified and re-distributed.
Attention at SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS)
Debashish Roy,Manish Shrivastava
International Workshop on Semantic Evaluation, SemEval, 2023
@inproceedings{bib_Atte_2023, AUTHOR = {Debashish Roy, Manish Shrivastava}, TITLE = {Attention at SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS)}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2023}}
In this paper, we have worked on interpretability, trust, and understanding of the decisions made by models in the form of classification tasks. The task is divided into 3 subtasks. The first task consists of determining Binary Sexism Detection. The second task describes the Category of Sexism. The third task describes a more Fine-grained Category of Sexism. Our work explores solving these tasks as a classification problem by fine-tuning transformerbased architecture. We have performed several experiments with our architecture, including combining multiple transformers, using domain adaptive pretraining on the unlabelled dataset provided by Reddit and Gab, Joint learning, and taking different layers of transformers as input to a classification head. Our system (with team name ‘Attention’) was able to achieve a macro F1 score of 0.839 for task A, 0.5835 macro F1 score for task B and 0.3356 macro F1 score for task C at the Codalab SemEval Competition. Later we improved the accuracy of Task B to 0.6228 and Task C to 0.3693 in the test set.
TOURISMNLG: A Multi-lingual Generative Benchmark for the Tourism Domain
Sahil Manoj Bhatt,Sahaj Agarwal,Omkar Gurjar,Manish Gupta,Manish Shrivastava
European Conference on Information Retrieval, ECIR, 2023
@inproceedings{bib_TOUR_2023, AUTHOR = {Sahil Manoj Bhatt, Sahaj Agarwal, Omkar Gurjar, Manish Gupta, Manish Shrivastava}, TITLE = {TOURISMNLG: A Multi-lingual Generative Benchmark for the Tourism Domain}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2023}}
The tourism industry is important for the benefits it brings and due to its role as a commercial activity that creates demand and growth for many more industries. Yet there is not much work on data science problems in tourism. Unfortunately, there is not even a standard benchmark for evaluation of tourismspecific data science tasks and models. In this paper, we propose a benchmark, TOURISMNLG, of five natural language generation (NLG) tasks for the tourism domain and release corresponding datasets with standard train, validation and test splits1 . Further, previously proposed data science solutions for tourism problems do not leverage the recent benefits of transfer learning. Hence, we also contribute the first rigorously pretrained mT5 and mBART model checkpoints for the tourism domain. The models have been pretrained on four tourism-specific datasets covering different aspects of tourism. Using these models, we present initial baseline results on the benchmark tasks. We hope that the dataset will promote active research for natural language generation for travel and tourism. Keywords: NLG for Tourism · Long QA · Blog-Title Generation · Forum-Title Generation · Paragraph Generation · Short QA
Indian Language Summarization using Pretrained Sequence-to-Sequence Models
Urlana Ashok,Sahil Manoj Bhatt,NIRMAL SURANGE,Manish Shrivastava
Forum for Information Retrieval Evaluation, FIRE, 2023
@inproceedings{bib_Indi_2023, AUTHOR = {Urlana Ashok, Sahil Manoj Bhatt, NIRMAL SURANGE, Manish Shrivastava}, TITLE = {Indian Language Summarization using Pretrained Sequence-to-Sequence Models}, BOOKTITLE = {Forum for Information Retrieval Evaluation}. YEAR = {2023}}
The ILSUM shared task focuses on text summarization for two major Indian languages- Hindi and Gujarati, along with English. In this task, we experiment with various pretrained sequence-to-sequence models to find out the best model for each of the languages. We present a detailed overview of the models and our approaches in this paper. We secure the first rank across all three sub-tasks (English, Hindi and Gujarati). This paper also extensively analyzes the impact of k-fold cross-validation while experimenting with limited data size, and we also perform various experiments with a combination of the original and a filtered version of the data to determine the efficacy of the pretrained models.
BRR-QA: Boosting Ranking and Reading in Open-Domain Question Answering
Manish Kumar Singh,Manish Shrivastava
Joint International Conference on Data Science & Management of Data, CODS-COMAD, 2023
@inproceedings{bib_BRR-_2023, AUTHOR = {Manish Kumar Singh, Manish Shrivastava}, TITLE = {BRR-QA: Boosting Ranking and Reading in Open-Domain Question Answering}, BOOKTITLE = {Joint International Conference on Data Science & Management of Data}. YEAR = {2023}}
Open-domain question qnswering (OpenQA) involves a retriever for selecting relevant passages from large text corpora (e.g. Wikipedia) and a reading comprehension (RC) model for extracting answers from these retrieved passages. The retrieved passages are often noisy. Since OpenQA relies heavily on efficient passages for better answer prediction, many passage ranker models have been pro- posed to filter out noisy passages. However, their performance is limited because their ranker model scores each passage separately by modelling only the relationship between query and passage. Thus, they could not capture local context information. Their ranker model also ignored the rich initial rank of passages ranked by a search engine. This paper presents a Passage Ranker model that captures local-context information through cross-passage interac- tion. Our ranker model integrates initial ranking and uses modified attention in the cross-passage interaction to compute a better confi- dence score for each passage. Moreover, we integrate SRL into our passage reader and train it on proposed sampled data. Our semantic reader can absorb contextual semantics. Experimental results on four public OpenQA datasets show that o
Neural Network Architecture for Credibility Assessment of Textual Claims
RAJAT SINGH,NURENDRA CHOUDHARY,ISHITA BINDLISH,Manish Srivastava
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2023
@inproceedings{bib_Neur_2023, AUTHOR = {RAJAT SINGH, NURENDRA CHOUDHARY, ISHITA BINDLISH, Manish Srivastava}, TITLE = {Neural Network Architecture for Credibility Assessment of Textual Claims}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2023}}
Text articles with false claims, especially news, have recently become aggravating for the Internet users. These articles are in wide circulation and readers face difficulty discerning fact from fiction. Previous work on credibility assessment has focused on factual analysis and linguistic features. The task's main challenge is the distinction between the features of true and false articles. In this paper, we propose a novel approach called Credibility Outcome (CREDO) which aims at scoring the credibility of an article in an open domain setting. CREDO consists of different modules for capturing various features responsible for the credibility of an article. These features includes credibility of the article's source and author, semantic similarity between the article and related credible articles retrieved from a knowledge base, and sentiments conveyed by the article. A neural network architecture learns the contribution of each of these modules to the overall credibility of an article. Experiments on Snopes dataset reveals that CREDO outperforms the state-of-the-art approaches based on linguistic features.
Sentiment analysis of code-mixed languages leveraging resource rich languages
NURENDRA CHOUDHARY,RAJAT SINGH,ISHITA BINDLISH,Manish Srivastava
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2023
@inproceedings{bib_Sent_2023, AUTHOR = {NURENDRA CHOUDHARY, RAJAT SINGH, ISHITA BINDLISH, Manish Srivastava}, TITLE = {Sentiment analysis of code-mixed languages leveraging resource rich languages}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2023}}
Code-mixed data is an important challenge of natural language processing because its characteristics completely vary from the traditional structures of standard languages. In this paper, we propose a novel approach called Sentiment Analysis of Code-Mixed Text (SACMT) to classify sentences into their corresponding sentiment-positive, negative or neutral, using contrastive learning. We utilize the shared parameters of siamese networks to map the sentences of code-mixed and standard languages to a common sentiment space. Also, we introduce a basic clustering based preprocessing method to capture variations of code-mixed transliterated words. Our experiments reveal that SACMT outperforms the state-of-the-art approaches in sentiment analysis for code-mixed text by 7.6% in accuracy and 10.1% in F-score.
Automatic normalization of word variations in code-mixed social media text
RAJAT SINGH,NURENDRA CHOUDHARY,Manish Srivastava
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2023
@inproceedings{bib_Auto_2023, AUTHOR = {RAJAT SINGH, NURENDRA CHOUDHARY, Manish Srivastava}, TITLE = {Automatic normalization of word variations in code-mixed social media text}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2023}}
Social media platforms such as Twitter and Facebook are becoming popular in multilingual societies. This trend induces portmanteau of South Asian languages with English. The blend of multiple languages as code-mixed data has recently become popular in research communities for various NLP tasks. Code-mixed data consist of anomalies such as grammatical errors and spelling variations. In this paper, we leverage the contextual property of words where the different spelling variation of words share similar context in a large noisy social media text. We capture different variations of words belonging to same context in an unsupervised manner using distributed representations of words. Our experiments reveal that preprocessing of the code-mixed dataset based on our approach improves the performance in state-of-the-art part-of-speech tagging (POS-tagging) and sentiment analysis tasks.
Contrastive learning of emoji-based representations for resource-poor languages
NURENDRA CHOUDHARY,RAJAT SINGH,ISHITA BINDLISH,Manish Srivastava
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2023
@inproceedings{bib_Cont_2023, AUTHOR = {NURENDRA CHOUDHARY, RAJAT SINGH, ISHITA BINDLISH, Manish Srivastava}, TITLE = {Contrastive learning of emoji-based representations for resource-poor languages}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2023}}
The introduction of emojis (or emoticons) in social media platforms has given the users an increased potential for expression. We propose a novel method called Classification of Emojis using Siamese Network Architecture (CESNA) to learn emoji-based representations of resource-poor languages by jointly training them with resource-rich languages using a siamese network. CESNA model consists of twin Bi-directional Long Short-Term Memory Recurrent Neural Networks (Bi-LSTM RNN) with shared parameters joined by a contrastive loss function based on a similarity metric. The model learns the representations of resource-poor and resource-rich language in a common emoji space by using a similarity metric based on the emojis present in sentences from both languages. The model, hence, projects sentences with similar emojis closer to each other and the sentences with different emojis farther from one another. Experiments on large-scale Twitter datasets of resource-rich languages-English and Spanish and resource-poor languages-Hindi and Telugu reveal that CESNA outperforms the state-of-the-art emoji prediction approaches based on distributional semantics, semantic rules, lexicon lists and deep neural network representations without shared parameters.
LTRC @ Causal News Corpus 2022: Extracting and Identifying Causal Elements using Adapters
Hiranmai Sri Adibhatla,Manish Shrivastava
EMNLP Workshop, EMNLP-W, 2022
@inproceedings{bib_LTRC_2022, AUTHOR = {Hiranmai Sri Adibhatla, Manish Shrivastava}, TITLE = {LTRC @ Causal News Corpus 2022: Extracting and Identifying Causal Elements using Adapters}, BOOKTITLE = {EMNLP Workshop}. YEAR = {2022}}
Causality detection and identification is centered on identifying semantic and cognitive
connections in a sentence. In this paper, we
describe the effort of team LTRC for Causal
News Corpus - Event Causality Shared Task
2022 at the 5th Workshop on Challenges and
Applications of Automated Extraction of Sociopolitical Events from Text (CASE 2022) (Tan
et al., 2022a). The shared task consisted of
two subtasks: 1) identifying if a sentence contains a causality relation, and 2) identifying
spans of text that correspond to cause, effect
and signals. We fine-tuned transformer-based
models with adapters for both subtasks. Our
best-performing models obtained a binary F1
score of 0.853 on held-out data for subtask 1
and a macro F1 score of 0.032 on held-out data
for subtask 2. Our approach is ranked third in
subtask 1 and fourth in subtask 2. The paper describes our experiments, solutions, and analysis
in detail.
SConE:Contextual Relevance based Significant CompoNent Extraction from Contracts
Hiranmai Sri Adibhatla,Manish Shrivastava
International Conference on Natural Language Processing., ICON, 2022
@inproceedings{bib_SCon_2022, AUTHOR = {Hiranmai Sri Adibhatla, Manish Shrivastava}, TITLE = {SConE:Contextual Relevance based Significant CompoNent Extraction from Contracts}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2022}}
Automatic extraction of “significant” components of a legal contract, has the potential to simplify the end user’s comprehension. In essence, “significant” pieces of information have 1) information pertaining to material/practical details about a specific contract and 2) information that is novel or comes as a “surprise” for a specific type of contract. It indicates that the significance of a component may be defined at an individual contract level and at a contract-type level. A component, sentence, or paragraph, may be considered significant at a contract level if it contains contract-specific information (CSI), like names, dates, or currency terms. At a contract-type level, components that deviate significantly from the norm for the type may be considered significant (type-specific information (TSI)). In this paper, we present approaches to extract “significant” components from a contract at both these levels. We attempt to do this by identifying patterns in a pool of documents of the same kind. Consequently, in our approach, the solution is formulated in two parts: identifying CSI using a BERT-based contract-specific information extractor and identifying TSI by scoring sentences in a contract for their likelihood. In this paper, we even describe the annotated corpus of contract documents that we created as a first step toward the development of such a language-processing system. We also release a dataset of contract samples containing sentences belonging to CSI and TSI.
Named Entity Recognition for Code-Mixed Kannada-English Social Media Data
SAI LAKSHMI POOJITHA NANDIGAM,APPIDI ABHINAV REDDY,Manish Shrivastava
International Conference on Natural Language Processing., ICON, 2022
@inproceedings{bib_Name_2022, AUTHOR = {SAI LAKSHMI POOJITHA NANDIGAM, APPIDI ABHINAV REDDY, Manish Shrivastava}, TITLE = {Named Entity Recognition for Code-Mixed Kannada-English Social Media Data}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2022}}
Named Entity Recognition (NER) is a critical task in the field of Natural Language Processing (NLP) and is also a sub-task of Information Extraction. There has been a significant amount of work done in entity extraction and Named Entity Recognition for resource-rich languages. Entity extraction from code-mixed social media data like tweets from twitter complicates the problem due to its unstructured, informal, and incomplete information available in tweets. Here, we present work on NER in Kannada-English code-mixed social media corpus with corresponding named entity tags referring to Organisation (Org), Person (Pers), and Location (Loc). We experimented with machine learning classification models like Conditional Random Fields (CRF), Bi-LSTM, and Bi-LSTM-CRF models on our corpus.
Diverse Multi-Answer Retrieval with Determinantal Point Processes
SAI LAKSHMI POOJITHA NANDIGAM,Nikhil Rayaprolu,Manish Shrivastava
International Conference on Computational Linguistics, COLING, 2022
@inproceedings{bib_Dive_2022, AUTHOR = {SAI LAKSHMI POOJITHA NANDIGAM, Nikhil Rayaprolu, Manish Shrivastava}, TITLE = {Diverse Multi-Answer Retrieval with Determinantal Point Processes}, BOOKTITLE = {International Conference on Computational Linguistics}. YEAR = {2022}}
Often questions provided to open-domain question answering systems are ambiguous. Traditional QA systems that provide a single answer are incapable of answering ambiguous questions since the question may be interpreted in several ways and may have multiple distinct answers. In this paper, we address multi-answer retrieval which entails retrieving passages that can capture majority of the diverse answers to the question. We propose a re-ranking based approach using Determinantal point processes utilizing BERT as kernels. Our method jointly considers query-passage relevance and passage-passage correlation to retrieve passages that are both query-relevant and diverse. Results demonstrate that our re-ranking technique outperforms state-of-the-art method on the AmbigQA dataset.
HashSet - A Dataset For Hashtag Segmentation
Kodali Prashant,Akshala Bhatnagar,Naman Ahuja,Manish Shrivastava,Ponnurangam Kumaraguru
International Conference on Language Resources and Evaluation, LREC, 2022
@inproceedings{bib_Hash_2022, AUTHOR = {Kodali Prashant, Akshala Bhatnagar, Naman Ahuja, Manish Shrivastava, Ponnurangam Kumaraguru}, TITLE = {HashSet - A Dataset For Hashtag Segmentation}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2022}}
Hashtag segmentation is the task of breaking a hashtag into its constituent tokens. Hashtags often encode the essence of usergenerated posts, along with information like topic and sentiment, which are useful in downstream tasks. Hashtags prioritize brevity and are written in unique ways - transliterating and mixing languages, spelling variations, creative named entities. Benchmark datasets used for the hashtag segmentation task - STAN, BOUN - are small in size and extracted from a single set of tweets. However, datasets should reflect the variations in writing styles of hashtags and also account for domain and language specificity, failing which the results will misrepresent model performance. We argue that model performance should be assessed on a wider variety of hashtags, and datasets should be carefully curated. To this end, we propose HashSet, a dataset comprising of: a) 1.9k manually annotated dataset; b) 3.3M loosely supervised dataset. HashSet dataset is sampled from a different set of tweets when compared to existing datasets and provides an alternate distribution of hashtags to build and validate hashtag segmentation models. We show that the performance of SOTA models for Hashtag Segmentation drops substantially on proposed dataset, indicating that the proposed dataset provides an alternate set of hashtags to train and assess models. Datasets and results are released publicly and can be accessed from https://github.com/prashantkodali/HashSet
SyMCoM - Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing
Kodali Prashant,Anmol Goel,Monojit Choudhury,Manish Shrivastava,Ponnurangam Kumaraguru
Findings of the Association for Computational Linguistics, FACL, 2022
@inproceedings{bib_SyMC_2022, AUTHOR = {Kodali Prashant, Anmol Goel, Monojit Choudhury, Manish Shrivastava, Ponnurangam Kumaraguru}, TITLE = {SyMCoM - Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing}, BOOKTITLE = {Findings of the Association for Computational Linguistics}. YEAR = {2022}}
Code mixing is the linguistic phenomenon where bilingual speakers tend to switch between two or more languages in conversations. Recent work on code-mixing in computational settings has leveraged social media code mixed texts to train NLP models. For capturing the variety of code mixing in, and across corpus, Language ID (LID) tags based measures (CMI) have been proposed. Syntactical variety/patterns of code-mixing and their relationship vis-a-vis computational model’s performance is under explored. In this work, we investigate a collection of English(en)-Hindi(hi) code-mixed datasets from a syntactic lens to propose, SyMCoM, an indicator of syntactic variety in code-mixed text, with intuitive theoretical bounds. We train SoTA en-hi PoS tagger, accuracy of 93.4%, to reliably compute PoS tags on a corpus, and demonstrate the utility of SyMCoM by applying it on various syntactical categories on a collection of datasets, and compare datasets using the measure.
SConE: Contextual Relevance based Significant CompoNent Extraction from Contracts
Hiranmai Sri Adibhatla,Manish Shrivastava
International Conference on Natural Language Processing., ICON, 2022
@inproceedings{bib_SCon_2022, AUTHOR = {Hiranmai Sri Adibhatla, Manish Shrivastava}, TITLE = {SConE: Contextual Relevance based Significant CompoNent Extraction from Contracts}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2022}}
Automatic extraction of “significant” components of a legal contract, has the potential to simplify the end user’s comprehension. In essence, “significant” pieces of information have 1) information pertaining to material/practical details about a specific contract and 2) information that is novel or comes as a “surprise” for a specific type of contract. It indicates that the significance of a component may be defined at an individual contract level and at a contract-type level. A component, sentence, or paragraph, may be considered significant at a contract level if it contains contract-specific information (CSI), like names, dates, or currency terms. At a contract-type level, components that deviate significantly from the norm for the type may be considered significant (type-specific information (TSI)). In this paper, we present approaches to extract “significant” components from a contract at both these levels. We attempt to do this by identifying patterns in a pool of documents of the same kind. Consequently, in our approach, the solution is formulated in two parts: identifying CSI using a BERT-based contract-specific information extractor and identifying TSI by scoring sentences in a contract for their likelihood. In this paper, we even describe the annotated corpus of contract documents that we created as a first step toward the development of such a language-processing system. We also release a dataset of contract samples containing sentences belonging to CSI and TSI.
TeQuAD: Telugu Question Answering Dataset
Rakesh Kumar Vemula,Mani Kanta Sai Nuthi,Manish Shrivastava
International Conference on Natural Language Processing., ICON, 2022
@inproceedings{bib_TeQu_2022, AUTHOR = {Rakesh Kumar Vemula, Mani Kanta Sai Nuthi, Manish Shrivastava}, TITLE = {TeQuAD: Telugu Question Answering Dataset}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2022}}
Recent state of the art models and new datasets have advanced many Natural Language Processing areas, especially, Machine Reading Comprehension tasks have improved with the help of datasets like SQuAD (Stanford Question Answering Dataset). But, large high quality datasets are still not a reality for low resource languages like Telugu to record progress in MRC. In this paper, we present a Telugu Question Answering Dataset - TeQuAD with the size of 82k parallel triples created by translating triples from the SQuAD. We also introduce a few methods to create similar Question Answering datasets for the low resource languages. Then, we present the performance of our models which outperform baseline models on Monolingual and Cross Lingual Machine Reading Comprehension (CLMRC) setups, the best of them resulting in an F1 score of 83 % and Exact Match (EM) score of 61 %
Framework for Recasting Table-to-Text Generation Data for Tabular Inference
Aashna Jena,Vivek Gupta,Manish Shrivastava,Julian Martin Eisenschlos
Empirical Methods in Natural Language Processing-Findings, EMNLP-F, 2022
@inproceedings{bib_Fram_2022, AUTHOR = {Aashna Jena, Vivek Gupta, Manish Shrivastava, Julian Martin Eisenschlos}, TITLE = {Framework for Recasting Table-to-Text Generation Data for Tabular Inference}, BOOKTITLE = {Empirical Methods in Natural Language Processing-Findings}. YEAR = {2022}}
Prior work on constructing challenging tabular inference data centered primarily on human annotation or automatic synthetic generation. Both techniques have their own set of issues. Human annotation, despite its diversity and superior reasoning, struggles from scaling concerns. Synthetic data, on the other hand, despite its scalability, suffers from lack of linguistic and reasoning diversity. In this paper, we address both of these concerns by presenting a recasting approach that semi-automatically generates tabular NLI instances. We transform the table2text dataset ToTTo (Parikh et al., 2020) into a tabular NLI dataset using our proposed framework. We demonstrate the use of our recasted data as an evaluation benchmark as well as augmentation data to improve performance on TabFact (Chen et al., 2020b). Furthermore, we test the effectiveness of models trained on our data on the TabFact benchmark in the zeroshot scenario.
REUSE: REference-free UnSupervised quality Estimation Metric
Ananya Mukherjee,Manish Shrivastava
Conference on Machine Translation, WMT, 2022
@inproceedings{bib_REUS_2022, AUTHOR = {Ananya Mukherjee, Manish Shrivastava}, TITLE = {REUSE: REference-free UnSupervised quality Estimation Metric}, BOOKTITLE = {Conference on Machine Translation}. YEAR = {2022}}
This paper describes our submission to the WMT2022 shared metrics task. Our unsupervised metric estimates the translation quality at chunk-level and sentence-level. Source and target sentence chunks are retrieved by using a multi-lingual chunker. Chunk-level similarity is computed by leveraging BERT contextual word embeddings and sentence similarity scores are calculated by leveraging sentence embeddings of Language-Agnostic BERT models. The final quality estimation score is obtained by mean pooling the chunk-level and sentence-level similarity scores. This paper outlines our experiments and also reports the correlation with human judgements for en-de, en-ru and zh-en language pairs of WMT17, WMT18 and WMT19 testsets. Our submission will be made available at https://github. com/AnanyaCoder/WMT22Submission_REUSE
Unsupervised Embedding-based Metric for MT Evaluation with Improved Human Correlation
Ananya Mukherjee,Manish Shrivastava
Conference on Machine Translation, WMT, 2022
@inproceedings{bib_Unsu_2022, AUTHOR = {Ananya Mukherjee, Manish Shrivastava}, TITLE = {Unsupervised Embedding-based Metric for MT Evaluation with Improved Human Correlation}, BOOKTITLE = {Conference on Machine Translation}. YEAR = {2022}}
In this paper, we describe our submission to the WMT22 metrics shared task. Our metric focuses on computing contextual and syntactic equivalences along with lexical, morphological and semantic similarity. The intent is to capture fluency and context of the MT outputs along with their adequacy. Fluency is captured using syntactic similarity and context is captured using sentence similarity leveraging sentence embeddings. The final sentence translation score is the weighted combination of three similarity scores: a) Syntactic Similarity b) Lexical, Morphological and Semantic Similarity and c) Contextual Similarity. This paper outlines two improved versions of MEE ie, MEE2 and MEE4. Additionally, we perform our experiments on language pairs of en-de, en-ru and zh-en from WMT17-19 testset and further report the correlation with human assessments. Our submission will be made available at https://github. com/AnanyaCoder/WMT22Submission.
Towards fine-grained classification of climate change related social media text
Roopal Vaid,Kartikey Pant,Manish Shrivastava
Association for Computational Linguistics: Student Research Workshop, ACL - W, 2022
@inproceedings{bib_Towa_2022, AUTHOR = {Roopal Vaid, Kartikey Pant, Manish Shrivastava}, TITLE = {Towards fine-grained classification of climate change related social media text}, BOOKTITLE = {Association for Computational Linguistics: Student Research Workshop}. YEAR = {2022}}
With climate change becoming a cause of concern worldwide, it becomes essential to gauge people’s reactions. This can help educate and spread awareness about it and help leaders improve decision-making. This work explores the fine-grained classification and Stance detection of climate change-related social media text. Firstly, we create two datasets, ClimateStance and ClimateEng, consisting of 3777 tweets each, posted during the 2019 United Nations Framework Convention on Climate Change and comprehensively outline the dataset collection, annotation methodology, and dataset composition. Secondly, we propose the task of Climate Change stance detection based on our proposed ClimateStance dataset. Thirdly, we propose a fine-grained classification based on the ClimateEng dataset, classifying social media text into five categories: Disaster, Ocean/Water, Agriculture/Forestry, Politics, and General. We benchmark both the datasets for climate change stance detection and fine-grained classification using state-of-the-art methods in text classification. We also create a Reddit-based dataset for both the tasks, ClimateReddit, consisting of 6262 pseudo-labeled comments along with 329 manually annotated comments for the label. We then perform semi-supervised experiments for both the tasks and benchmark their results using the best-performing model for the supervised experiments. Lastly, we provide insights into the ClimateStance and ClimateReddit using part-of-speech tagging and named-entity recognition.
SyMCoM-Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing
Kodali Prashant,Anmol Goel,Monojit Choudhury,Manish Shrivastava,Ponnurangam Kumaraguru
Conference of the Association of Computational Linguistics, ACL, 2022
@inproceedings{bib_SyMC_2022, AUTHOR = {Kodali Prashant, Anmol Goel, Monojit Choudhury, Manish Shrivastava, Ponnurangam Kumaraguru}, TITLE = {SyMCoM-Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing}, BOOKTITLE = {Conference of the Association of Computational Linguistics}. YEAR = {2022}}
Code mixing is the linguistic phenomenon where bilingual speakers tend to switch between two or more languages in conversations. Recent work on code-mixing in computational settings has leveraged social media code mixed texts to train NLP models. For capturing the variety of code mixing in, and across corpus, Language ID (LID) tags based measures (CMI) have been proposed. Syntactical variety/patterns of code-mixing and their relationship vis-a-vis computational model’s performance is under explored. In this work, we investigate a collection of English(en)-Hindi(hi) code-mixed datasets from a syntactic lens to propose, SyMCoM, an indicator of syntactic variety in code-mixed text, with intuitive theoretical bounds. We train SoTA en-hi PoS tagger, accuracy of 93.4%, to reliably compute PoS tags on a corpus, and demonstrate the utility of SyMCoM by applying it on various syntactical categories on a collection of datas
TeSum: Human-Generated Abstractive Summarization Corpus for Telugu
Urlana Ashok,NIRMAL SURANGE,Baswani Pavan,Ravva Priyanka,Manish Shrivastava
International Conference on Language Resources and Evaluation, LREC, 2022
@inproceedings{bib_TeSu_2022, AUTHOR = {Urlana Ashok, NIRMAL SURANGE, Baswani Pavan, Ravva Priyanka, Manish Shrivastava}, TITLE = {TeSum: Human-Generated Abstractive Summarization Corpus for Telugu}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2022}}
Expert human annotation for summarization is definitely an expensive task, and can not be done on huge scales. But with this work, we show that even with a crowd sourced summary generation approach, quality can be controlled by aggressive expert informed filtering and sampling-based human evaluation. We propose a pipeline that crowd-sources summarization data and then aggressively filters the content via: automatic and partial expert evaluation. Using this pipeline we create a high-quality Telugu Abstractive Summarization dataset (TeSum) which we validate with sampling-based human evaluation. We also provide baseline numbers for various models commonly used for summarization. A number of recently released datasets for summarization, scraped the web-content relying on the assumption that summary is made available with the article by the publishers. While this assumption holds for multiple resources (or news-sites) in English, it should not be generalised across languages without thorough analysis and verification. Our analysis clearly shows that this assumption does not hold true for most Indian language news resources. We show that our proposed filtration pipeline can even be applied to these large-scale scraped datasets to extract better quality article-summary pairs
Precogiiith at hinglisheval: Leveraging code-mixing metrics & language model embeddings to estimate code-mix quality
Kodali Prashant,Tanmay Sachan,Akshay Goindani,Anmol Goel,Naman Ahuja,Manish Shrivastava,Ponnurangam Kumaraguru
Technical Report, arXiv, 2022
@inproceedings{bib_Prec_2022, AUTHOR = {Kodali Prashant, Tanmay Sachan, Akshay Goindani, Anmol Goel, Naman Ahuja, Manish Shrivastava, Ponnurangam Kumaraguru}, TITLE = {Precogiiith at hinglisheval: Leveraging code-mixing metrics & language model embeddings to estimate code-mix quality}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Code-Mixing is a phenomenon of mixing two or more languages in a speech event and is prevalent in multilingual societies. Given the low-resource nature of Code-Mixing, machine generation of code-mixed text is a prevalent approach for data augmentation. However, evaluating the quality of such machine generated code-mixed text is an open problem. In our submission to HinglishEval, a sharedtask collocated with INLG2022, we attempt to build models factors that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality. HinglishEval Shared Task consists of two sub-tasks - a) Quality rating prediction); b) Disagreement prediction. We leverage popular codemixed metrics and embeddings of multilingual large language models (MLLMs) as features, and train task specific MLP regression models. Our approach could not beat the baseline results. However, for Subtask-A our team ranked a close second on F-1 and Cohen’s Kappa Score measures and first for Mean Squared Error measure. For Subtask-B our approach ranked third for F1 score, and first for Mean Squared Error measure. Code of our submission can be accessed here.
Bilingual Tabular Inference: A Case Study on Indic Languages
Chaitanya Agarwal,Vivek Gupta,Anoop Kunchukuttan,Manish Shrivastava
North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL- HLT, 2022
@inproceedings{bib_Bili_2022, AUTHOR = {Chaitanya Agarwal, Vivek Gupta, Anoop Kunchukuttan, Manish Shrivastava}, TITLE = {Bilingual Tabular Inference: A Case Study on Indic Languages}, BOOKTITLE = {North American Chapter of the Association for Computational Linguistics: Human Language Technologies}. YEAR = {2022}}
Existing research on Tabular Natural Language Inference (TNLI) exclusively examines the task in a monolingual setting where the tabular premise and hypothesis are in the same language. However, due to the uneven distribution of text resources on the web across languages, it is common to have the tabular premise in a high resource language and the hypothesis in a low resource language. As a result, we present the challenging task of bilingual Tabular Natural Language Inference (bTNLI), in which the tabular premise and a hypothesis over it are in two separate languages. We construct EI-InfoTabS: an English-Indic bTNLI dataset by translating the textual hypotheses of the English TNLI dataset InfoTabS into eleven major Indian languages. We thoroughly investigate how pre-trained multilingual models learn and perform on EI-InfoTabS. Our study shows that the performance on bTNLI can be close to its monolingual counterpart, with translate-train, translate-test and unified-train being strongly competitive baselines.
Tesla at SemEval-2022 Task 4: Patronizing and Condescending Language Detection using Transformer-based Models with Data Augmentation
Sahil Manoj Bhatt,Manish Shrivastava
International Workshop on Semantic Evaluation, SemEval, 2022
@inproceedings{bib_Tesl_2022, AUTHOR = {Sahil Manoj Bhatt, Manish Shrivastava}, TITLE = {Tesla at SemEval-2022 Task 4: Patronizing and Condescending Language Detection using Transformer-based Models with Data Augmentation}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2022}}
This paper describes our system for Task 4 of SemEval 2022: Patronizing and Condescending Language (PCL) Detection. For sub-task 1, where the objective is to classify a text as PCL or non-PCL, we use a T5 Model fine-tuned on the dataset. For sub-task 2, which is a multi-label classification problem, we use a RoBERTa model fine-tuned on the dataset. Given that the key challenge in this task is classification on an imbalanced dataset, our models rely on an augmented dataset that we generate using paraphrasing. We found that these two models yield the best results out of all the other approaches we tried.
“Kanglish alli names!” Named Entity Recognition for Kannada-English Code-Mixed Social Media Data
Sumukh S,Manish Shrivastava
Workshop on Noisy User-generated Text, W- NUT, 2022
@inproceedings{bib_“K_2022, AUTHOR = {Sumukh S, Manish Shrivastava}, TITLE = {“Kanglish alli names!” Named Entity Recognition for Kannada-English Code-Mixed Social Media Data}, BOOKTITLE = {Workshop on Noisy User-generated Text}. YEAR = {2022}}
Code-mixing (CM) is a frequently observed phenomenon on social media platforms in multilingual societies such as India. While the increase in code-mixed content on these platforms provides good amount of data for studying various aspects of code-mixing, the lack of automated text analysis tools makes such studies difficult. To overcome the same, tools such as language identifiers and parts of-speech (POS) taggers for analysing code-mixed data have been developed. One such tool is Named Entity Recognition (NER), an important Natural Language Processing (NLP) task, which is not only a subtask of Information Extraction, but is also needed for downstream NLP tasks such as semantic role labeling. While entity extraction from social media data is generally difficult due to its informal nature, code-mixed data further complicates the problem due to its informal, unstructured and incomplete information. In this work, we present the first ever corpus for Kannada-English code-mixed social media data with the corresponding named entity tags for NER. We provide strong baselines with machine learning classification models such as CRF, Bi-LSTM, and Bi-LSTM-CRF on our corpus with word, character, and lexical features.
LTRC @MuP 2022: Multi-Perspective Scientific Document Summarization Using Pre-trained Generation Models
Urlana Ashok,NIRMAL SURANGE,Manish Shrivastava
Workshop on Scholarly Document Processing, SDP-W, 2022
@inproceedings{bib_LTRC_2022, AUTHOR = {Urlana Ashok, NIRMAL SURANGE, Manish Shrivastava}, TITLE = {LTRC @MuP 2022: Multi-Perspective Scientific Document Summarization Using Pre-trained Generation Models}, BOOKTITLE = {Workshop on Scholarly Document Processing}. YEAR = {2022}}
The MuP-2022 shared task focuses on multiperspective scientific document summarization. Given a scientific document, with multiple reference summaries, our goal was to develop a model that can produce a generic summary covering as many aspects of the document as covered by all of its reference summaries. This paper describes our best official model, a finetuned BART-large, along with a discussion on the challenges of this task and some of our unofficial models including SOTA generation models. Our submitted model out performedthe given, MuP 2022 shared task, baselines on ROUGE-2, ROUGE-L and average ROUGE F1-scores. Code of our submission can be ac-cessed here.
SYSTEM AND METHOD FOR GENERATIQUERYEABLE STRUCTURED DOCUMENT FROM AN UNSTRUCTURED DOCUMENT USING MACHINE LEARNING
Manish Shrivastava,Vishnu Ramesh
United States Patent, Us patent, 2022
@inproceedings{bib_SYST_2022, AUTHOR = {Manish Shrivastava, Vishnu Ramesh }, TITLE = {SYSTEM AND METHOD FOR GENERATIQUERYEABLE STRUCTURED DOCUMENT FROM AN UNSTRUCTURED DOCUMENT USING MACHINE LEARNING}, BOOKTITLE = {United States Patent}. YEAR = {2022}}
A system for generating a queryable structured document from an unstructured document using a machine learning model is provided . The system ( i ) identifies breakpoints in the unstructured document ,( ii ) segments the unstructured document into one or more fragments based on identified breakpoints , ( iii ) classifies the one or more fragments as one or more title fragments or one or more non - title fragments based on a sequence of a position of words used in each fragment of the one or more fragments , ( iv ) constructs a data tree using the one or more title fragments and the one or more non - title fragments as a node of the ata tree ; ( v ) assigns one or more vectors to each node of the data tree , and ( vi ) generates a structured document by providing matrix representation for each node of the data tree .
Leveraging Data Recasting to Enhance Tabular Reasoning
Aashna Jena,Vivek Gupta,Manish Shrivastava,Julian Martin Eisenschlos
Empirical Methods in Natural Language Processing-Findings, EMNLP-F, 2022
@inproceedings{bib_Leve_2022, AUTHOR = {Aashna Jena, Vivek Gupta, Manish Shrivastava, Julian Martin Eisenschlos}, TITLE = {Leveraging Data Recasting to Enhance Tabular Reasoning}, BOOKTITLE = {Empirical Methods in Natural Language Processing-Findings}. YEAR = {2022}}
Creating challenging tabular inference data is essential for learning complex reasoning. Prior work has mostly relied on two data generation strategies. The first is human annotation, which yields linguistically diverse data but is difficult to scale. The second category for creation is synthetic generation, which is scalable and cost effective but lacks inventiveness. In this research, we present a framework for semi-automatically recasting existing tabular data to make use of the benefits of both approaches. We utilize our framework to build tabular NLI instances from five datasets that were initially intended for tasks like table2text creation, tabular Q/A, and semantic parsing. We demonstrate that recasted data could be used as evaluation benchmarks as well as augmentation data to enhance performance on tabular NLI tasks. Furthermore, we investigate the effectiveness of models trained on recasted data in the zero-shot scenario, and analyse trends in performance across different recasted datasets types.
Diverse Multi-Answer Retrieval with Determinantal Point Processes
SAI LAKSHMI POOJITHA NANDIGAM,Nikhil Rayaprolu,Manish Shrivastava
Technical Report, arXiv, 2022
@inproceedings{bib_Dive_2022, AUTHOR = {SAI LAKSHMI POOJITHA NANDIGAM, Nikhil Rayaprolu, Manish Shrivastava}, TITLE = {Diverse Multi-Answer Retrieval with Determinantal Point Processes}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Often questions provided to open-domain question answering systems are ambiguous. Traditional QA systems that provide a single answer are incapable of answering ambiguous questions since the question may be interpreted in several ways and may have multiple distinct answers. In this paper, we address multi-answer retrieval which entails retrieving passages that can capture majority of the diverse answers to the question. We propose a re-ranking based approach using Determinantal point processes utilizing BERT as kernels. Our method jointly considers query-passage relevance and passage-passage correlation to retrieve passages that are both query-relevant and diverse. Results demonstrate that our reranking technique outperforms state-of-the-art method on the AmbigQA dataset.
Generalised Spherical Text Embedding
Souvik Banerjee,Bamdev Mishra,Pratik Jawanpuria,Manish Shrivastava
Technical Report, arXiv, 2022
@inproceedings{bib_Gene_2022, AUTHOR = {Souvik Banerjee, Bamdev Mishra, Pratik Jawanpuria, Manish Shrivastava}, TITLE = {Generalised Spherical Text Embedding}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
This paper aims to provide an unsupervised modelling approach that allows for a more flexible representation of text embeddings. It jointly encodes the words and the paragraphs as individual matrices of arbitrary column dimension with unit Frobenius norm. The representation is also linguistically motivated with the introduction of a novel similarity metric. The proposed modelling and the novel similarity metric exploits the matrix structure of embeddings. We then go on to show that the same matrices can be reshaped into vectors of unit norm and transform our problem into an optimization problem over the spherical manifold. We exploit manifold optimization to efficiently train the matrix embeddings. We also quantitatively verify the quality of our text embeddings by showing that they demonstrate improved results in document classification, document clustering, and semantic textual similarity benchmark tests.
LTRC@ Causal News Corpus 2022: Extracting and identifying causal elements using adapters
Hiranmai Sri Adibhatla,Manish Shrivastava
work shop on Challenges and Applications of Automated Extraction of Socio-political Events from Text, CASE - W, 2022
@inproceedings{bib_LTRC_2022, AUTHOR = {Hiranmai Sri Adibhatla, Manish Shrivastava}, TITLE = {LTRC@ Causal News Corpus 2022: Extracting and identifying causal elements using adapters}, BOOKTITLE = {work shop on Challenges and Applications of Automated Extraction of Socio-political Events from Text}. YEAR = {2022}}
Causality detection and identification is centered on identifying semantic and cognitive connections in a sentence. In this paper, we describe the effort of team LTRC for Causal News Corpus - Event Causality Shared Task 2022 at the 5th Workshop on Challenges and Applications of Automated Extraction of Sociopolitical Events from Text (CASE 2022) (Tan et al., 2022a). The shared task consisted of two subtasks: 1) identifying if a sentence contains a causality relation, and 2) identifying spans of text that correspond to cause, effect and signals. We fine-tuned transformer-based models with adapters for both subtasks. Our best-performing models obtained a binary F1 score of 0.853 on held-out data for subtask 1 and a macro F1 score of 0.032 on held-out data for subtask 2. Our approach is ranked third in subtask 1 and fourth in subtask 2. The paper describes our experiments, solutions, and analysis in detail.
Docinfer: Document-level natural language inference using optimal evidence selection
Puneet Mathur,Riyaz Bhat,Gautam Kunapuli,Manish Shrivastava,Dinesh Manocha,Maneesh Singh
Conference on Empirical Methods in Natural Language Processing, EMNLP, 2022
@inproceedings{bib_Doci_2022, AUTHOR = {Puneet Mathur, Riyaz Bhat, Gautam Kunapuli, Manish Shrivastava, Dinesh Manocha, Maneesh Singh}, TITLE = {Docinfer: Document-level natural language inference using optimal evidence selection}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2022}}
We present DocInfer-a novel, end-to-end Document-level Natural Language Inference model that builds a hierarchical document graph enriched through inter-sentence relations (topical, entity-based, concept-based), performs paragraph pruning using the novel SubGraph Pooling layer, followed by optimal evidence selection based on REINFORCE algorithm to identify the most important context sentences for a given hypothesis. Our evidence selection mechanism allows it to transcend the input length limitation of modern BERT-like Transformer models while presenting the entire evidence together for inferential reasoning. We show this is an important property needed to reason on large documents where the evidence may be fragmented and located arbitrarily far from each other. Extensive experiments on popular corpora-DocNLI, ContractNLI, and ConTRoL datasets, and our new proposed dataset called CaseHoldNLI on the task of legal judicial reasoning, demonstrate significant performance gains of 8-12% over SOTA methods. Our ablation studies validate the impact of our model. Performance improvement of 3-6% on annotation-scarce downstream tasks of fact verification, multiple-choice QA, and contract clause retrieval demonstrates the usefulness of DocInfer beyond primary NLI tasks.
Kanglish alli names! Named Entity Recognition for Kannada-English Code-Mixed Social Media Data
Sumukh S,Manish Shrivastava
Workshop on Noisy User-generated Text, W- NUT, 2022
@inproceedings{bib_Kang_2022, AUTHOR = {Sumukh S, Manish Shrivastava}, TITLE = {Kanglish alli names! Named Entity Recognition for Kannada-English Code-Mixed Social Media Data}, BOOKTITLE = {Workshop on Noisy User-generated Text}. YEAR = {2022}}
Code-mixing (CM) is a frequently observed phenomenon on social media platforms in mul- tilingual societies such as India. While the increase in code-mixed content on these plat- forms provides good amount of data for study- ing various aspects of code-mixing, the lack of automated text analysis tools makes such studies difficult. To overcome the same, tools such as language identifiers and parts of-speech (POS) taggers for analysing code-mixed data have been developed. One such tool is Named Entity Recognition (NER), an important Natu- ral Language Processing (NLP) task, which is not only a subtask of Information Extraction, but is also needed for downstream NLP tasks such as semantic role labeling. While entity extraction from social media data is generally difficult due to its informal nature, code-mixed data further complicates the problem due to its informal, unstructured and incomplete infor- mation. In this work, we present the first ever corpus for Kannada-English code-mixed social media data with the corresponding named en- tity tags for NER. We provide strong baselines with machine learning classification models such as CRF, Bi-LSTM, and Bi-LSTM-CRF on our corpus with word, character, and lexical features.
Is My Model Using the Right Evidence? Systematic Probes for Examining Evidence-Based Tabular Reasoning
Vivek Gupta,Riyaz A. Bhat,Atreyee Ghosal,Manish Shrivastava,Maneesh Singh,Vivek Srikumar
Transactions of the Association for Computational Linguistics, TACL, 2022
@inproceedings{bib_Is_M_2022, AUTHOR = {Vivek Gupta, Riyaz A. Bhat, Atreyee Ghosal, Manish Shrivastava, Maneesh Singh, Vivek Srikumar}, TITLE = {Is My Model Using the Right Evidence? Systematic Probes for Examining Evidence-Based Tabular Reasoning}, BOOKTITLE = {Transactions of the Association for Computational Linguistics}. YEAR = {2022}}
Neural models command state-of-the-art per- formance across NLP tasks, including ones involving ‘‘reasoning’’. Models claiming to reason about the evidence presented to them should attend to the correct parts of the input while avoiding spurious patterns therein, be self-consistent in their predictions across in- puts, and be immune to biases derived from their pre-training in a nuanced, context- sensitive fashion. Do the prevalent *BERT- family of models do so? In this paper, we study this question using the problem of rea- soning on tabular data. Tabular inputs are especially well-suited for the study—they ad- mit systematic probes targeting the properties listed above. Our experiments demonstrate that a RoBERTa-based model, representative of the current state-of-the-art, fails at reason- ing on the following counts: it (a) ignores relevant parts of the evidence, (b) is over- sensitive to annotation artifacts, and (c) relies on the knowledge encoded in the pre-trained language model rather than the evidence pre- sented in its tabular inputs. Finally, through inoculation experiments, we show that fine- tuning the model on perturbed data does not help it overcome the above challenges.
MARCUS: An Event-Centric NLP Pipeline that generates Character Arcs from Narratives
Sriharsh Bhyravajjula1,Ujwal Narayan1,Manish Shrivastava
CEUR Workshop Proceedings, CEUR, 2022
@inproceedings{bib_MARC_2022, AUTHOR = {Sriharsh Bhyravajjula1, Ujwal Narayan1, Manish Shrivastava}, TITLE = {MARCUS: An Event-Centric NLP Pipeline that generates Character Arcs from Narratives}, BOOKTITLE = {CEUR Workshop Proceedings}. YEAR = {2022}}
Character arcs are important theoretical devices employed in literary studies to understand character journeys, identify tropes across literary genres, and establish similarities between narratives. This work addresses the novel task of computationally generating event-centric, relation-based character arcs from narratives. Providing a quantitative representation for arcs brings tangibility to a theoretical concept and paves the way for subsequent applications. We present MARCUS (Modelling Arcs for Understanding Stories), an NLP pipeline that extracts events, participant characters, implied emotion, and sentiment to model inter-character relations. MARCUS tracks and aggregates these relations across the narrative to generate character arcs as graphical plots. We generate character arcs from two extended fantasy series, Harry Potter and Lord of the Rings. We evaluate our approach before outlining existing challenges, suggesting applications of our pipeline, and discussing future work
HashSet--A Dataset For Hashtag Segmentation
Kodali Prashant,Akshala Bhatnagar,Naman Ahuja,Manish Srivastava,Ponnurangam Kumaraguru
Technical Report, arXiv, 2022
@inproceedings{bib_Hash_2022, AUTHOR = {Kodali Prashant, Akshala Bhatnagar, Naman Ahuja, Manish Srivastava, Ponnurangam Kumaraguru}, TITLE = {HashSet--A Dataset For Hashtag Segmentation}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Hashtag segmentation is the task of breaking a hashtag into its constituent tokens. Hashtags often encode the essence of usergenerated posts, along with information like topic and sentiment, which are useful in downstream tasks. Hashtags prioritize brevity and are written in unique ways - transliterating and mixing languages, spelling variations, creative named entities. Benchmark datasets used for the hashtag segmentation task - STAN, BOUN - are small in size and extracted from a single set of tweets. However, datasets should reflect the variations in writing styles of hashtags and also account for domain and language specificity, failing which the results will misrepresent model performance. We argue that model performance should be assessed on a wider variety of hashtags, and datasets should be carefully curated. To this end, we propose HashSet, a dataset comprising of: a) 1.9k manually annotated dataset; b) 3.3M loosely supervised dataset. HashSet dataset is sampled from a different set of tweets when compared to existing datasets and provides an alternate distribution of hashtags to build and validate hashtag segmentation models. We show that the performance of SOTA models for Hashtag Segmentation drops substantially on proposed dataset, indicating that the proposed dataset provides an alternate set of hashtags to train and assess models. Datasets and results are released publicly and can be accessed
Fusion of Intrinsic & Extrinsic Sentential Traits for Text Coherence Assessment
Manish Kumar Singh,Manish Shrivastava
Joint International Conference on Data Science & Management of Data, CODS-COMAD, 2021
@inproceedings{bib_Fusi_2021, AUTHOR = {Manish Kumar Singh, Manish Shrivastava}, TITLE = {Fusion of Intrinsic & Extrinsic Sentential Traits for Text Coherence Assessment}, BOOKTITLE = {Joint International Conference on Data Science & Management of Data}. YEAR = {2021}}
In this paper, we investigate the problem of text coherence modeling with a new perspective of learning better distributed sentence representation by incorporating document dependent and document independent features of a sentence. To this end, we propose a datadriven end-to-end novel neural coherence model that captures text coherence by exploiting the semantic and distributional aspects of the sentences in a document. The network is able to capture various document independent and document dependent features of a sentence which is imperative in assessing text coherence. Experiments on the standard Sentence Ordering task indicate that our proposed model shows significant performance gain of 1.7 percent in terms of accuracy score compared with the state-of-the-art baselines.
A3-108 Machine Translation System for Similar Language Translation Shared Task 2021
Saumitra Yadav,Manish Shrivastava
Conference on Machine Translation, WMT, 2021
@inproceedings{bib_A3-1_2021, AUTHOR = {Saumitra Yadav, Manish Shrivastava}, TITLE = {A3-108 Machine Translation System for Similar Language Translation Shared Task 2021}, BOOKTITLE = {Conference on Machine Translation}. YEAR = {2021}}
In this paper, we describe our submissions for the Similar Language Translation Shared Task 2021. We built 3 systems in each direction for the Tamil ⇐⇒ Telugu language pair. This paper outlines experiments with various tokenization schemes to train statistical models. We also report the configuration of the submitted systems and results produced by them.
The Effect of Pretraining on Extractive Summarization for Scientific Documents
Yash Gupta,Preethi Jyoth,Pawan Sasanka Ammanamanchi,Shikha Bordia,Arjun Manoharan,Deepak Mittal,Ramakanth Pasunuru,Manish Shrivastava,Maneesh Singh,Mohit Bansal
Workshop on Scholarly Document Processing, SDP-W, 2021
@inproceedings{bib_The__2021, AUTHOR = {Yash Gupta, Preethi Jyoth, Pawan Sasanka Ammanamanchi, Shikha Bordia, Arjun Manoharan, Deepak Mittal, Ramakanth Pasunuru, Manish Shrivastava, Maneesh Singh, Mohit Bansal}, TITLE = {The Effect of Pretraining on Extractive Summarization for Scientific Documents}, BOOKTITLE = {Workshop on Scholarly Document Processing}. YEAR = {2021}}
Large pretrained models have seen enormous success in extractive summarization tasks. In this work, we investigate the influence of pretraining on a BERT-based extractive summarization system for scientific documents. We derive significant performance improvements using an intermediate pretraining step that leverages existing summarization datasets and report state-of-the-art results on a recently released scientific summarization dataset, SciTLDR. We systematically analyze the intermediate pretraining step by varying the size and domain of the pretraining corpus, changing the length of the input sequence in the target task and varying target tasks. We also investigate how intermediate pretraining interacts with contextualized word embeddings trained on different domains.
“Subverting the Jewtocracy”: Online Antisemitism Detection Using Multimodal Deep Learning
Mohit Chandra,Dheeraj Pailla,Himanshu Bhatia,Aadilmehdi J Sanchawala,Manish Gupta,Manish Shrivastava,Ponnurangam Kumaraguru
WEB SCIENCE, WEBSCI, 2021
@inproceedings{bib_“S_2021, AUTHOR = {Mohit Chandra, Dheeraj Pailla, Himanshu Bhatia, Aadilmehdi J Sanchawala, Manish Gupta, Manish Shrivastava, Ponnurangam Kumaraguru}, TITLE = {“Subverting the Jewtocracy”: Online Antisemitism Detection Using Multimodal Deep Learning}, BOOKTITLE = {WEB SCIENCE}. YEAR = {2021}}
The exponential rise of online social media has enabled the creation, distribution, and consumption of information at an unprecedented rate. However, it has also led to the burgeoning of various forms of online abuse. Increasing cases of online antisemitism have be- come one of the major concerns because of its socio-political con- sequences. Unlike other major forms of online abuse like racism, sexism, etc., online antisemitism has not been studied much from a machine learning perspective. To the best of our knowledge, we present the first work in the direction of automated multimodal de- tection of online antisemitism. The task poses multiple challenges that include extracting signals across multiple modalities, contex- tual references, and handling multiple aspects of antisemitism. Un- fortunately, there does not exist any publicly available benchmark corpus for this critical task. Hence, we collect and label two datasets with 3,102 and 3,509 social media posts from Twitter and Gab re- spectively. Further, we present a multimodal deep learning system that detects the presence of antisemitic content and its specific anti- semitism category using text and images from posts. We perform an extensive set of experiments on the two datasets to evaluate the ef- ficacy of the proposed system. Finally, we also present a qualitative analysis of our study.
Enhancing Aspect Extraction for Hindi
ARGHYA BHATTACHARYA,Alok Debnath,Manish Srivastava
Workshop on e-Commerce and NLP, ECNLP, 2021
@inproceedings{bib_Enha_2021, AUTHOR = {ARGHYA BHATTACHARYA, Alok Debnath, Manish Srivastava}, TITLE = {Enhancing Aspect Extraction for Hindi}, BOOKTITLE = {Workshop on e-Commerce and NLP}. YEAR = {2021}}
Aspect extraction is not a well-explored topic in Hindi, with only one corpus having been developed for the task. In this paper, we dis- cuss the merits of the existing corpus in terms of quality, size, sparsity, and performance in aspect extraction tasks using established mod- els. To provide a better baseline corpus for aspect extraction, we translate the SemEval 2014 aspect-based sentiment analysis dataset and annotate the aspects in that data. We provide rigorous guidelines and a replicable methodology for this task. We quantitatively evaluate the translations and annotations us- ing inter-annotator agreement scores. We also evaluate our dataset using state-of-the-art neu- ral aspect extraction models in both monolin- gual and multilingual settings and show that the models perform far better on our corpus than on the existing Hindi dataset. With this, we establish our corpus as the gold-standard aspect extraction dataset in Hindi
A3-108 Machine Translation System for LoResMT Shared Task@ MT Summit 2021 Conference
Saumitra Yadav,Manish Srivastava
Workshop on Technologies for MT of Low Resource Languages, LoResMT, 2021
@inproceedings{bib_A3-1_2021, AUTHOR = {Saumitra Yadav, Manish Srivastava}, TITLE = {A3-108 Machine Translation System for LoResMT Shared Task@ MT Summit 2021 Conference}, BOOKTITLE = {Workshop on Technologies for MT of Low Resource Languages}. YEAR = {2021}}
In this paper, we describe our submissions for LoResMT Shared Task @MT Summit 2021 Con- ference. We built statistical translation systems in each direction for English ⇐⇒ Marathi lan- guage pair. This paper outlines initial baseline experiments with various tokenization schemes to train models. Using optimal tokenization scheme we create synthetic data and further train augmented dataset to create more statistical models. Also, we reorder English to match Marathi syntax to further train another set of baseline and data augmented models using various tok- enization schemes. We report configuration of the submitted systems and results produced by them
Topic Shift Detection for Mixed Initiative Response
Konigari Rachna,Saurabh Chand Ramola,Alluri Vijay Vardhan,Manish Srivastava
Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL, 2021
@inproceedings{bib_Topi_2021, AUTHOR = {Konigari Rachna, Saurabh Chand Ramola, Alluri Vijay Vardhan, Manish Srivastava}, TITLE = {Topic Shift Detection for Mixed Initiative Response}, BOOKTITLE = {Annual Meeting of the Special Interest Group on Discourse and Dialogue}. YEAR = {2021}}
Topic diversion occurs frequently with engaging open-domain dialogue systems like virtual assistants. The balance between staying on topic and rectifying the topic drift is important for a good collaborative system. In this paper, we present a model which uses a finetuned XLNet-base to classify the utterances pertaining to the major topic of conversation and those which are not, with a precision of 84%. We propose a preliminary study, classifying utterances into major, minor and offtopics, which further extends into a system initiative for diversion rectification. A case study was conducted where a system initiative is emulated as a response to the user going off-topic, mimicking a common occurrence of mixed initiative present in natural human-human conversation. This task of classifying utterances into those which belong to the major theme or not, would also help us in identification of relevant sentences for tasks like dialogue summarization and information extraction from conversations.
Enhancing Aspect Extraction in Hindi
Arghya Bhattacharya, Alok Debnath,Manish Srivastava
Workshop on e-Commerce and NLP, ECNLP, 2021
@inproceedings{bib_Enha_2021, AUTHOR = {Arghya Bhattacharya, Alok Debnath, Manish Srivastava}, TITLE = {Enhancing Aspect Extraction in Hindi}, BOOKTITLE = {Workshop on e-Commerce and NLP}. YEAR = {2021}}
Aspect extraction is not a well-explored topic in Hindi, with only one corpus having been developed for the task. In this paper, we discuss the merits of the existing corpus in terms of quality, size, sparsity, and performance in aspect extraction tasks using established models. To provide a better baseline corpus for aspect extraction, we translate the SemEval 2014 aspect-based sentiment analysis dataset and annotate the aspects in that data. We provide rigorous guidelines and a replicable methodology for this task. We quantitatively evaluate the translations and annotations using inter-annotator agreement scores. We also evaluate our dataset using state-of-the-art neural aspect extraction models in both monolingual and multilingual settings and show that the models perform far better on our corpus than on the existing Hindi dataset. With this, we establish our corpus as the gold-standard aspect extraction dataset in Hindi.
A Dynamic Head Importance Computation Mechanism for Neural Machine Translation
Akshay Goindani,Manish Srivastava
Recent advance in Natural language Processing, RANLP, 2021
@inproceedings{bib_A_Dy_2021, AUTHOR = {Akshay Goindani, Manish Srivastava}, TITLE = {A Dynamic Head Importance Computation Mechanism for Neural Machine Translation}, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2021}}
Multiple parallel attention mechanisms that use multiple attention heads facilitate greater performance of the Transformer model for various applications e.g., Neural Machine Translation (NMT), text classification. In multi-head attention mechanism, different heads attend to different parts of the input. However, the limitation is that multiple heads might attend to the same part of the input, resulting in multiple heads being redundant. Thus, the model resources are under-utilized. One approach to avoid this is to prune least important heads based on certain importance score. In this work, we focus on designing a Dynamic Head Importance Computation Mechanism (DHICM) to dynamically calculate the importance of a head with respect to the input. Our insight is to design an additional attention layer together with multi-head attention, and utilize the outputs of the multi-head attention along with the input, to compute the importance for each head. Additionally, we add an extra loss function to prevent the model from assigning same score to all heads, to identify more important heads and improvise performance. We analyzed performance of DHICM for NMT with different languages. Experiments on different datasets show that DHICM outperforms traditional Transformer-based approach by large margin, especially, when less training data is available.
Battling Hateful Content in Indic Languages HASOC21
Kadam Aditya Santosh,Anmol Goel,Jivitesh Jain,Jushaan Singh Kalra,Mallika Subramanian,Manvith Muthukuru Reddy,Kodali Prashant,T H Arjun,Manish Srivastava,Manish Srivastava
Hate Speech and Offensive Content Identification in Indo-European Languages, HASOC, 2021
@inproceedings{bib_Batt_2021, AUTHOR = {Kadam Aditya Santosh, Anmol Goel, Jivitesh Jain, Jushaan Singh Kalra, Mallika Subramanian, Manvith Muthukuru Reddy, Kodali Prashant, T H Arjun, Manish Srivastava, Manish Srivastava}, TITLE = {Battling Hateful Content in Indic Languages HASOC21}, BOOKTITLE = {Hate Speech and Offensive Content Identification in Indo-European Languages}. YEAR = {2021}}
The extensive rise in consumption of online social media (OSMs) by a large number of people poses a critical problem of curbing the spread of hateful content on these platforms. With the growing usage of OSMs in multiple languages, the task of detecting and characterizing hate becomes more complex. The subtle variations of code-mixed texts along with switching scripts only add to the complexity. This paper presents a solution for the HASOC 2021 Multilingual Twitter Hate-Speech Detection challenge by team PreCog IIIT Hyderabad. We adopt a multilingual transformer based approach and describe our architecture for all 6 sub-tasks as part of the challenge. Out of the 6 teams that participated in all the sub tasks, our submissions rank 3rd overall.
“Subverting the Jewtocracy”: Online Antisemitism Detection Using Multimodal Deep Learning
Mohit Chandra,Dheeraj Reddy Pailla,Himanshu Bhatia,Aadilmehdi J Sanchawala,Manish Gupta,Manish Srivastava,Ponnurangam Kumaraguru
ACM Web Science Conference, ACMWSC, 2021
@inproceedings{bib_“S_2021, AUTHOR = {Mohit Chandra, Dheeraj Reddy Pailla, Himanshu Bhatia, Aadilmehdi J Sanchawala, Manish Gupta, Manish Srivastava, Ponnurangam Kumaraguru}, TITLE = {“Subverting the Jewtocracy”: Online Antisemitism Detection Using Multimodal Deep Learning }, BOOKTITLE = {ACM Web Science Conference}. YEAR = {2021}}
The exponential rise of online social media has enabled the creation, distribution, and consumption of information at an unprecedented rate. However, it has also led to the burgeoning of various forms of online abuse. Increasing cases of online antisemitism have become one of the major concerns because of its socio-political consequences. Unlike other major forms of online abuse like racism, sexism, etc., online antisemitism has not been studied much from a machine learning perspective. To the best of our knowledge, we present the first work in the direction of automated multimodal detection of online antisemitism. The task poses multiple challenges that include extracting signals across multiple modalities, contextual references, and handling multiple aspects of antisemitism. Unfortunately, there does not exist any publicly available benchmark corpus for this critical task. Hence, we collect and label two datasets with 3,102 and 3,509 social media posts from Twitter and Gab respectively. Further, we present a multimodal deep learning system that detects the presence of antisemitic content and its specific antisemitism category using text and images from posts. We perform an extensive set of experiments on the two datasets to evaluate the efficacy of the proposed system. Finally, we also present a qualitative analysis of our study.
Volta at SemEval-2021 Task 9: Statement Verification and Evidence Finding with Tables using TAPAS and Transfer Learning
Devansh Gautam,Kshitij Gupta,Manish Srivastava
Technical Report, arXiv, 2021
@inproceedings{bib_Volt_2021, AUTHOR = {Devansh Gautam, Kshitij Gupta, Manish Srivastava}, TITLE = {Volta at SemEval-2021 Task 9: Statement Verification and Evidence Finding with Tables using TAPAS and Transfer Learning}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
Tables are widely used in various kinds of documents to present information concisely. Understanding tables is a challenging problem that requires an understanding of language and table structure, along with numerical and logical reasoning. In this paper, we present our systems to solve Task 9 of SemEval-2021: Statement Verification and Evidence Finding with Tables (SEM-TAB-FACTS). The task consists of two subtasks: (A) Given a table and a statement, predicting whether the table supports the statement and (B) Predicting which cells in the table provide evidence for/against the statement. We fine-tune TAPAS (a model which extends BERT's architecture to capture tabular structure) for both the subtasks as it has shown state-of-the-art performance in various table understanding tasks. In subtask A, we evaluate how transfer learning and standardizing tables to have a single header row improves TAPAS' performance. In subtask B, we evaluate how different fine-tuning strategies can improve TAPAS' performance. Our systems achieve an F1 score of 67.34 in subtask A three-way classification, 72.89 in subtask A two-way classification, and 62.95 in subtask B.
Translate and Classify: Improving Sequence Level Classification for English-Hindi Code-Mixed Data
Devansh Gautam,Kshitij Gupta,Manish Srivastava
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2021
@inproceedings{bib_Tran_2021, AUTHOR = {Devansh Gautam, Kshitij Gupta, Manish Srivastava}, TITLE = {Translate and Classify: Improving Sequence Level Classification for English-Hindi Code-Mixed Data}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2021}}
Code-mixing is a common phenomenon in multilingual societies around the world and is especially common in social media texts. Traditional NLP systems, usually trained on monolingual corpora, do not perform well on code-mixed texts. Training specialized models for code-switched texts is difficult due to the lack of large-scale datasets. Translating code-mixed data into standard languages like English could improve performance on various code-mixed tasks since we can use transfer learning from state-of-the-art English models for processing the translated data. This paper focuses on two sequence-level classification tasks for English-Hindi code mixed texts, which are part of the GLUECoS benchmark-Natural Language Inference and Sentiment Analysis. We propose using various pre-trained models that have been fine-tuned for similar English-only tasks and have shown state-of-the-art performance. We further fine-tune these models on the translated code-mixed datasets and achieve state-of-the-art performance in both tasks. To translate English-Hindi code-mixed data to English, we use mBART, a pre-trained multilingual sequence-to-sequence model that has shown competitive performance on various low-resource machine translation pairs and has also shown performance gains in languages that were not in its pre-training corpus.
CoMeT: Towards Code-Mixed Translation Using Parallel Monolingual Sentences
Devansh Gautam,Kodali Prashant,Kshitij Gupta,Anmol Goel,Manish Srivastava,Ponnurangam Kumaraguru
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2021
@inproceedings{bib_CoMe_2021, AUTHOR = {Devansh Gautam, Kodali Prashant, Kshitij Gupta, Anmol Goel, Manish Srivastava, Ponnurangam Kumaraguru}, TITLE = {CoMeT: Towards Code-Mixed Translation Using Parallel Monolingual Sentences}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2021}}
Code-mixed languages are very popular in multilingual societies around the world, yet the resources lag behind to enable robust systems on such languages. A major contributing factor is the informal nature of these languages which makes it difficult to collect code-mixed data. In this paper, we propose our system for Task 1 of CACLS 2021 to generate a machine translation system for English to Hinglish in a supervised setting. Translating in the given direction can help expand the set of resources for several tasks by translating valuable datasets from high resource languages. We propose to use mBART, a pre-trained multilingual sequence-to-sequence model, and fully utilize the pre-training of the model by transliterating the roman Hindi words in the code-mixed sentences to Devanagri script. We evaluate how expanding the input by concatenating Hindi translations of the English sentences improves mBART’s performance. Our system gives a BLEU score of 12.22 on test set. Further, we perform a detailed error analysis of our proposed systems and explore the limitations of the provided dataset and metrics.
Implementation of Moisture-Evaporation Decision Tree for Stock Rate Prediction
Dr. Bhupesh Gour,Jay Prakash Maurya,Tripti Saxena,Manish Srivastava
International Conference on Innovative Computing & Communication (ICICC) 2021, ICICC, 2021
@inproceedings{bib_Impl_2021, AUTHOR = {Dr. Bhupesh Gour, Jay Prakash Maurya, Tripti Saxena, Manish Srivastava}, TITLE = {Implementation of Moisture-Evaporation Decision Tree for Stock Rate Prediction}, BOOKTITLE = {International Conference on Innovative Computing & Communication (ICICC) 2021}. YEAR = {2021}}
–Stock rate prediction always attracts investor; it is not only because of interest but also act as a big challenge and a mystery. This problem is very complex in nature and it is quite dynamic type of problem. A lot of volatility always remains and the Global stock market makes it a more challenging and interesting job of stock prediction. Predicting stock prices is a very risky job which may have lot of errors. These errors needs & should be minimized so that financial losses in investment in stock market can be minimized. In this research paper, we have applied supervised machine learning algorithm known as moisture and evaporation decision tree (MEDT). To predict stock prices, we have applied last five years of Historical data with attributes: open high, low and close rates and volume of the particular stock. MEDT algorithm on 5 year’s stock data of (historical prices) has been taken for experiments. As we already know that the market news always affects Stock price movements. The current research we have considered positive or negative news about particular stock. If the news is positive it is called moisture and it assigned a weight between 0 and 1 based on the type and the importance of the news for that particular stock. The news is very impactful then and its moisture value maybe point 0.8, otherwise news is less impactful then its value may vary 0.1 or 0.2. If the news is impactful on the stock rate in the positive direction then its magnitude will be positive moisture value otherwise it will be a negative moisture value. Technique is seen successful in 88% cases.
Subtl. ai at the FinSBD-2 task: Document Structure Identification by Paying Attention
Aman,Abhishek Arora,Sarath Chandra Pakala,Vishnu Ramesh,Manish Srivastava
Financial Technology and Natural Language Processing, FinNLP, 2021
@inproceedings{bib_Subt_2021, AUTHOR = {Aman, Abhishek Arora, Sarath Chandra Pakala, Vishnu Ramesh, Manish Srivastava}, TITLE = {Subtl. ai at the FinSBD-2 task: Document Structure Identification by Paying Attention}, BOOKTITLE = {Financial Technology and Natural Language Processing}. YEAR = {2021}}
This paper presents a methodology submitted to the FinSBD-2 shared task to extract well formed sentences, lists and items from noisy unstructured financial PDF documents in English language. The proposed architecture for document structure identification, is a combination of deep learning and heuristic based approaches. We use two unidirectional Long Short-Term Memory (LSTM) encoders to get the sentence split tokens from the set of all the possible split points. Further, the outputs are passed on to an attention based LSTM network to select only the well formed sentences from all the possible sentences. These outputs are merged to ultimately produce all possible well formed sentences. Apart from the sentences, lists and items are identified using a combination of heuristics which identify patterns in the data. The final F1 score, 0.217 on this task, is obtained by comparing the start and end indices of sentences, lists and items. We have presented another parameter, which is used to evaluate the class coverage by checking the overlap between the predicted and ground truth sentences and obtained an average 40% class coverage score. This metric is more useful for industry researchers who require coverage of the content rather than character level precision. The proposed approach will empower both academia and industry researchers in their effort to handle noisy documents for various NLP tasks by providing a simple, fast and robust approach to identify structure in their documents.
Fusion of Intrinsic & Extrinsic Sentential Traits for Text Coherence Assessment
MANISH SINGH,Manish Srivastava
India Joint International Conference on Data Science & Management of Data, COMAD/CODS, 2021
@inproceedings{bib_Fusi_2021, AUTHOR = {MANISH SINGH, Manish Srivastava}, TITLE = {Fusion of Intrinsic & Extrinsic Sentential Traits for Text Coherence Assessment}, BOOKTITLE = {India Joint International Conference on Data Science & Management of Data}. YEAR = {2021}}
In this paper, we investigate the problem of text coherence modeling with a new perspective of learning better distributed sentence representation by incorporating document dependent and document independent features of a sentence. To this end, we propose a data-driven end-to-end novel neural coherence model that captures text coherence by exploiting the semantic and distributional aspects of the sentences in a document. The network is able to capture various document independent and document dependent features of a sentence which is imperative in assessing text coherence. Experiments on the standard Sentence Ordering task indicate that our proposed model shows significant performance gain of 1.7 percent in terms of accuracy score compared with the state-of-the-art baselines.
SIS@IIITH at SemEval-2020 Task 8: An Overview of Simple Text Classification Methods for Meme Analysis
Sravani Boinepelli,Manish Shrivastava,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation , SemEval, 2020
@inproceedings{bib_SIS@_2020, AUTHOR = {Sravani Boinepelli, Manish Shrivastava, Vasudeva Varma Kalidindi}, TITLE = {SIS@IIITH at SemEval-2020 Task 8: An Overview of Simple Text Classification Methods for Meme Analysis}, BOOKTITLE = {International Workshop on Semantic Evaluation }. YEAR = {2020}}
Memes are steadily taking over the feeds of the public on social media. There is always the threat of malicious users on the internet posting offensive content, even through memes. Hence, the automatic detection of offensive images/memes is imperative along with detection of offensive text. However, this is a much more complex task as it involves both visual cues as well as language understanding and cultural/context knowledge. This paper describes our approach to the task of SemEval-2020 Task 8: Memotion Analysis. We chose to participate only in Task A which dealt with Sentiment Classification, which we formulated as a text classification problem. Through our experiments, we explored multiple training models to evaluate the performance of simple text classification algorithms on the raw text obtained after running OCR on meme images. Our submitted model achieved an accuracy of 72.69% and exceeded the existing baseline’s Macro F1 score by 8% on the official test dataset. Apart from describing our official submission, we shall elucidate how different classification models respond to this task.
Cross-Lingual Transfer for Hindi Discourse Relation Identification
ANIRUDH DAHIYA,Manish Srivastava,Dipti Mishra Sharma
Speech and Dialogue Conference, TSD, 2020
@inproceedings{bib_Cros_2020, AUTHOR = {ANIRUDH DAHIYA, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {Cross-Lingual Transfer for Hindi Discourse Relation Identification}, BOOKTITLE = {Speech and Dialogue Conference}. YEAR = {2020}}
Discourse relations between two textual spans in a document attempt to capture the coherent structure which emerges in language use. Automatic classification of these relations remains a challenging task especially in case of implicit discourse relations, where there is no explicit textual cue which marks the discourse relation. In low resource languages, this motivates the exploration of transfer learning approaches, more particularly the cross-lingual techniques towards discourse relation classification. In this work, we explore various cross-lingual transfer techniques on Hindi Discourse Relation Bank (HDRB), a Penn Discourse Treebank styled dataset for discourse analysis in Hindi and observe performance gains in both zero shot and finetuning settings on the Hindi Discourse Relation Classification task. This is the first effort towards exploring transfer learning for Hindi Discourse relation classification to the best of our knowledge.
AbuseAnalyzer: Abuse Detection, Severity and Target Prediction for Gab Posts
Mohit Chandra,Ashwin Pathak,Eesha Dutta,Paryul Jain,Manish,Manish Srivastava,Ponnurangam Kumaraguru
Technical Report, arXiv, 2020
@inproceedings{bib_Abus_2020, AUTHOR = {Mohit Chandra, Ashwin Pathak, Eesha Dutta, Paryul Jain, Manish, Manish Srivastava, Ponnurangam Kumaraguru}, TITLE = {AbuseAnalyzer: Abuse Detection, Severity and Target Prediction for Gab Posts}, BOOKTITLE = {Technical Report}. YEAR = {2020}}
While extensive popularity of online social media platforms has made information dissemination faster, it has also resulted in widespread online abuse of different types like hate speech, offensive language, sexist and racist opinions, etc. Detection and curtailment of such abusive content is critical for avoiding its psychological impact on victim communities, and thereby preventing hate crimes. Previous works have focused on classifying user posts into various forms of abusive behavior. But there has hardly been any focus on estimating the severity of abuse and the target. In this paper, we present a first of the kind dataset with 7601 posts from Gab which looks at online abuse from the perspective of presence of abuse, severity and target of abusive behavior. We also propose a system to address these tasks, obtaining an accuracy of~ 80% for abuse presence,~ 82% for abuse target detection, and~ 64% for abuse severity detection.
MEE: An Automatic Metric for Evaluation Using Embeddings for Machine Translation
Ananya Mukherjee,Ala Hema,Manish Srivastava,Dipti Mishra Sharma
International Conference on Data Science and Advanced Analytics, DSAA, 2020
@inproceedings{bib_MEE:_2020, AUTHOR = {Ananya Mukherjee, Ala Hema, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {MEE: An Automatic Metric for Evaluation Using Embeddings for Machine Translation}, BOOKTITLE = {International Conference on Data Science and Advanced Analytics}. YEAR = {2020}}
We propose MEE, an approach for automatic Machine Translation (MT) evaluation which leverages the similarity between embeddings of words in candidate and reference sentences to assess translation quality. Unigrams are matched based on their surface forms, root forms and meanings which aids to capture lexical, morphological and semantic equivalence. We perform experiments for MT from English to four Indian Languages (Telugu, Marathi, Bengali and Hindi) on a robust dataset comprising simple and complex sentences with good and bad translations. Further, it is observed that the proposed metric correlates better with human judgements than the existing widely used metrics.
Modeling ASR Ambiguity for Neural Dialogue State Tracking
VAISHALI PAL,Fabien Guillot,Manish Srivastava,Jean-Michel Renders,Laurent Besacier
Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020
@inproceedings{bib_Mode_2020, AUTHOR = {VAISHALI PAL, Fabien Guillot, Manish Srivastava, Jean-Michel Renders, Laurent Besacier}, TITLE = {Modeling ASR Ambiguity for Neural Dialogue State Tracking}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2020}}
Spoken dialogue systems typically use one or several (top-N) ASR sequence(s) for inferring the semantic meaning and tracking the state of the dialogue. However, ASR graphs, such as confusion networks (confnets), provide a compact representation of a richer hypothesis space than a top-N ASR list. In this paper, we study the benefits of using confusion networks with a neural dialogue state tracker (DST). We encode the 2-dimensional confnet into a 1-dimensional sequence of embed-dings using a confusion network encoder which can be used with any DST system. Our confnet encoder is plugged into the 'Global-locally Self-Attentive Dialogue State Tacker' (GLAD) model for DST and obtains significant improvements in both accuracy and inference time compared to using top-N ASR hypotheses .
Finding the Right One and Resolving it
PAYAL KULLAR,ARGHYA BHATTACHARYA,Manish Srivastava
The SIGNLL Conference on Computational Natural Language Learning, CoNLL, 2020
@inproceedings{bib_Find_2020, AUTHOR = {PAYAL KULLAR, ARGHYA BHATTACHARYA, Manish Srivastava}, TITLE = {Finding the Right One and Resolving it}, BOOKTITLE = {The SIGNLL Conference on Computational Natural Language Learning}. YEAR = {2020}}
One-anaphora has figured prominently in theoretical linguistic literature, but computational linguistics research on the phenomenon is sparse. Not only that, the long standing linguistic controversy between the determinative and the nominal anaphoric element one has propagated in the limited body of computational work on one-anaphora resolution, making this task harder than it is. In the present paper, we resolve this by drawing from an adequate linguistic analysis of the word one in different syntactic environments-once again highlighting the significance of linguistic theory in Natural Language Processing (NLP) tasks. We prepare an annotated corpus marking actual instances of one-anaphora with their textual antecedents, and use the annotations to experiment with state-of-the art neural models for one-anaphora resolution. Apart from presenting a strong neural baseline for this task, we contribute a gold-standard corpus, which is, to the best of our knowledge, the biggest resource on one-anaphora till date.
A3-108 Machine Translation System for Similar Language Translation Shared Task 2020
Saumitra Yadav,Manish Srivastava
Conference of the European Association for Machine Translation, EAMT, 2020
@inproceedings{bib_A3-1_2020, AUTHOR = {Saumitra Yadav, Manish Srivastava}, TITLE = {A3-108 Machine Translation System for Similar Language Translation Shared Task 2020}, BOOKTITLE = {Conference of the European Association for Machine Translation}. YEAR = {2020}}
In this paper, we describe our submissions for Similar Language Translation Shared Task 2020. We built 12 systems in each direction for Hindi⇐⇒ Marathi language pair. This paper outlines initial baseline experiments with various tokenization schemes to train statistical models. Using optimal tokenization scheme among these we created synthetic source side text with back translation. And prune synthetic text with language model scores. This synthetic data was then used along with training data in various settings to build translation models. We also report configuration of the submitted systems and results produced by them.
Creation of Corpus and analysis in Code-Mixed Kannada-English Twitter data for Emotion Prediction
APPIDI ABHINAV REDDY,SRIRANGAM VAMSHI KRISHNA,DARSI SUHAS,Manish Srivastava
International Conference on Computational Linguistics, COLING, 2020
@inproceedings{bib_Crea_2020, AUTHOR = {APPIDI ABHINAV REDDY, SRIRANGAM VAMSHI KRISHNA, DARSI SUHAS, Manish Srivastava}, TITLE = {Creation of Corpus and analysis in Code-Mixed Kannada-English Twitter data for Emotion Prediction}, BOOKTITLE = {International Conference on Computational Linguistics}. YEAR = {2020}}
Emotion prediction is a critical task in the field of Natural Language Processing (NLP). There has been a significant amount of work done in emotion prediction for resource-rich languages. There has been work done on code-mixed social media corpus but not on emotion prediction of Kannada-English code-mixed Twitter data. In this paper, we analyze the problem of emotion prediction on corpus obtained from code-mixed Kannada-English extracted from Twitter annotated with their respective ‘Emotion’for each tweet. We experimented with machine learning prediction models using features like Character N-Grams, Word N-Grams, Repetitive characters, and others on SVM and LSTM on our corpus, which resulted in an accuracy of 30% and 32% respectively.
SIS@ IIITH at SemEval-2020 Task 8: An Overview of Simple Text Classification Methods for Meme Analysis
Sravani Boinepelli,Manish Srivastava,Vasudeva Varma Kalidindi
International Workshop on Semantic Evaluation, SemEval, 2020
@inproceedings{bib_SIS@_2020, AUTHOR = {Sravani Boinepelli, Manish Srivastava, Vasudeva Varma Kalidindi}, TITLE = {SIS@ IIITH at SemEval-2020 Task 8: An Overview of Simple Text Classification Methods for Meme Analysis}, BOOKTITLE = {International Workshop on Semantic Evaluation}. YEAR = {2020}}
Memes are steadily taking over the feeds of the public on social media. There is always the threat of malicious users on the internet posting offensive content, even through memes. Hence, the automatic detection of offensive images/memes is imperative along with detection of offensive text. However, this is a much more complex task as it involves both visual cues as well as language understanding and cultural/context knowledge. This paper describes our approach to the task of SemEval-2020 Task 8: Memotion Analysis. We chose to participate only in Task A which dealt with Sentiment Classification, which we formulated as a text classification problem. Through our experiments, we explored multiple training models to evaluate the performance of simple text classification algorithms on the raw text obtained after running OCR on meme images. Our submitted model achieved an accuracy of 72.69% and exceeded the existing baseline’s Macro F1 score by 8% on the official test dataset. Apart from describing our official submission, we shall elucidate how different classification models respond to this task.
Tag2risk: Harnessing social music tags for characterizing depression risk
Aayush Surana,Yash Goyal,Manish Srivastava,Suvi Saarikallio,Vinoo A R
International Society for Music Information Retrieval, ISMIR, 2020
@inproceedings{bib_Tag2_2020, AUTHOR = {Aayush Surana, Yash Goyal, Manish Srivastava, Suvi Saarikallio, Vinoo A R}, TITLE = {Tag2risk: Harnessing social music tags for characterizing depression risk}, BOOKTITLE = {International Society for Music Information Retrieval}. YEAR = {2020}}
Musical preferences have been considered a mirror of the self. In this age of Big Data, online music streaming services allow us to capture ecologically valid music listening behavior and provide a rich source of information to identify several user-specific aspects. Studies have shown musical engagement to be an indirect representation of internal states including internalized symptomatology and depression. The current study aims at unearthing patterns and trends in the individuals at risk for depression as it manifests in naturally occurring music listening behavior. Mental well-being scores, musical engagement measures, and listening histories of this http URL users (N= 541) were acquired. Social tags associated with each listener's most popular tracks were analyzed to unearth the mood/emotions and genres associated with the users. Results revealed that social tags prevalent in the users at risk for depression were predominantly related to emotions depicting Sadness associated with genre tags representing neo-psychedelic-, avant garde-, dream-pop. This study will open up avenues for an MIR-based approach to characterizing and predicting risk for depression which can be helpful in early detection and additionally provide bases for designing music recommendations accordingly.
AVADHAN: System for Open-Domain Telugu Question Answering
Ravva Priyanka,Ashok Urlana,Manish Srivastava
India Joint International Conference on Data Science & Management of Data, COMAD/CODS, 2020
@inproceedings{bib_AVAD_2020, AUTHOR = {Ravva Priyanka, Ashok Urlana, Manish Srivastava}, TITLE = {AVADHAN: System for Open-Domain Telugu Question Answering}, BOOKTITLE = {India Joint International Conference on Data Science & Management of Data}. YEAR = {2020}}
This paper presents the Question Answering (QA) system for a low resource language like ‘Telugu’ named ‘AVADHAN’. This work started with preparing a pre-tagged data set for Telugu Question Classification (QC). We also explained the ambiguities and complexities involved in the data set. AVADHAN exhibits the comparisons between Support Vector Machine (SVM), Logistic Regression (LR) and Multi-Layer Perceptron (MLP) classifiers for achieving the plausible answers. After performing various experiments the overall accuracies obtained, for both ‘exact match’ and ‘partial match’ based approaches, were for SVM (31.6%, 68.5%), LR (31%, 66.6%) and for MLP (30%, 67%) respectively
Principle-to-Program: Neural Methods for Similar Question Retrieval in Online Communities
Muthusamy Chelliah,Manish Srivastava,Jaidam Ram Tej
European Conference on Information Retrieval, ECIR, 2020
@inproceedings{bib_Prin_2020, AUTHOR = {Muthusamy Chelliah, Manish Srivastava, Jaidam Ram Tej}, TITLE = {Principle-to-Program: Neural Methods for Similar Question Retrieval in Online Communities}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2020}}
Similar question retrieval is a challenge due to lexical gap between query and candidates in archive and is very different from traditional IR methods for duplicate detection, paraphrase identification and semantic equivalence. This tutorial covers recent deep learning techniques which overcome feature engineering issues with existing approaches based on translation models and latent topics. Hands-on proposal thus will introduce each concept from end user (e.g., questionanswer pairs) and technique (e.g., attention) perspectives, present state of the art methods and a walkthrough of programs executed on Jupyter notebook using real-world datasets demonstrating principles introduced
A Multi-Dimensional View of Aggression when voicing Opinion
ARJIT SRIVASTAVA,AVIJIT VAJPAYEE, Syed Sarfaraz Akhtar,Naman Jain,VINAY KUMAR SINGH,Manish Srivastava
International Conference on Language Resources and Evaluation, LREC, 2020
@inproceedings{bib_A_Mu_2020, AUTHOR = {ARJIT SRIVASTAVA, AVIJIT VAJPAYEE, Syed Sarfaraz Akhtar, Naman Jain, VINAY KUMAR SINGH, Manish Srivastava}, TITLE = {A Multi-Dimensional View of Aggression when voicing Opinion}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2020}}
The advent of social media has immensely proliferated the amount of opinions and arguments voiced on the internet. These virtual debates often present cases of aggression. While research has been focused largely on analyzing aggression and stance in isolation from each other, this work is the first attempt to gain an extensive and fine-grained understanding of patterns of aggression and figurative language use when voicing opinion. We present a Hindi-English code-mixed dataset of opinion on the politico-social issue of ‘2016 India banknote demonetisation‘ and annotate it across multiple dimensions such as aggression, hate speech, emotion arousal and figurative language usage (such as sarcasm/irony, metaphors/similes, puns/word-play)
NoEl: An Annotated Corpus for Noun Ellipsis in English
PAYAL KULLAR,Majmundar Kushal Alpeshkumar,Manish Srivastava
International Conference on Language Resources and Evaluation, LREC, 2020
@inproceedings{bib_NoEl_2020, AUTHOR = {PAYAL KULLAR, Majmundar Kushal Alpeshkumar, Manish Srivastava}, TITLE = {NoEl: An Annotated Corpus for Noun Ellipsis in English}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2020}}
Ellipsis resolution has been identified as an important step to improve the accuracy of mainstream Natural Language Processing (NLP) tasks such as information retrieval, event extraction, dialog systems, etc. Previous computational work on ellipsis resolution has focused on one type of ellipsis, namely Verb Phrase Ellipsis (VPE) and a few other related phenomenon. We extend the study of ellipsis by presenting the No(oun)El(lipsis) corpus - an annotated corpus for noun ellipsis and closely related phenomenon using the first hundred movies of Cornell Movie Dialogs Dataset. The annotations are carried out in a standoff annotation scheme that encodes the position of the licensor, the antecedent boundary, and Part-of-Speech (POS) tags of the licensor and antecedent modifier. Our corpus has 946 instances of exophoric and endophoric noun ellipsis, making it the biggest resource of noun ellipsis in English, to the best of our knowledge. We present a statistical study of our corpus with novel insights on the distribution of noun ellipsis, its licensors and antecedents. Finally, we perform the tasks of detection and resolution of noun ellipsis with different classifiers trained on our corpus and report baseline results.
A Fully Expanded Dependency Treebank for Telugu
SNEHA NALLANI,Manish Srivastava,Dipti Mishra Sharma
International Conference on Language Resources and Evaluation, LREC, 2020
@inproceedings{bib_A_Fu_2020, AUTHOR = {SNEHA NALLANI, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {A Fully Expanded Dependency Treebank for Telugu}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2020}}
Treebanks are an essential resource for syntactic parsing. The available Paninian dependency treebank(s) for Telugu is annotated only with inter-chunk dependency relations and not all words of a sentence are part of the parse tree. In this paper, we automatically annotate the intra-chunk dependencies in the treebank using a Shift-Reduce parser based on Context Free Grammar rules for Telugu chunks. We also propose a few additional intra-chunk dependency relations for Telugu apart from the ones used in Hindi treebank. Annotating intra-chunk dependencies finally provides a complete parse tree for every sentence in the treebank. Having a fully expanded treebank is crucial for developing end to end parsers which produce complete trees. We present a fully expanded dependency treebank for Telugu consisting of 3220 sentences. In this paper, we also convert the treebank annotated with Anncorra part-of-speech tagset to the latest BIS tagset. The BIS tagset is a hierarchical tagset adopted as a unified part-of-speech standard across all Indian Languages. The final treebank is made publicly available
Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus
Pranav Goel,Suhan Prabhu K,Alok Debnath,Priyank Modi,Manish Srivastava
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2020
@inproceedings{bib_Hind_2020, AUTHOR = {Pranav Goel, Suhan Prabhu K, Alok Debnath, Priyank Modi, Manish Srivastava}, TITLE = {Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2020}}
ISO-TimeML is an international standard for multilingual event annotation, detection, categorization and linking. In this paper, we present the Hindi TimeBank, an ISO-TimeML annotated reference corpus for the detection and classification of events, states and time expressions, and the links between them. Based on contemporary developments in Hindi event recognition, we propose languageindependent and language-specific deviations from the ISO-TimeML guidelines, but preserve the schema. These deviations include the inclusion of annotator confidence, and an independent mechanism of identifying and annotating states (such as copulars and existentials) With this paper, we present an open-source corpus, the Hindi TimeBank. The Hindi TimeBank is a 1,000 article dataset, with over 25,000 events, 3,500 states and 2,000 time expressions. We analyze the dataset in detail and provide a class-wise distribution of events, states and time expressions. Our guidelines and dataset are backed by high average inter-annotator agreement scores.
Detection and Annotation of Events in Kannada
Suhan Prabhu K,Ujwal Narayan N,Alok Debnath,Sumukh S,Manish Srivastava
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2020
@inproceedings{bib_Dete_2020, AUTHOR = {Suhan Prabhu K, Ujwal Narayan N, Alok Debnath, Sumukh S, Manish Srivastava}, TITLE = {Detection and Annotation of Events in Kannada}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2020}}
In this paper, we provide the basic guidelines towards the detection and linguistic analysis of events in Kannada. Kannada is a morphologically rich, resource poor Dravidian language spoken in southern India. As most information retrieval and extraction tasks are resource intensive, very little work has been done on Kannada NLP, with almost no efforts in discourse analysis and dataset creation for representing events or other semantic annotations in the text. In this paper, we linguistically analyze what constitutes an event in this language, the challenges faced with discourse level annotation and representation due to the rich derivational morphology of the language that allows free word order, numerous multi-word expressions, adverbial participle constructions and constraints on subject-verb relations. Therefore, this paper is one of the first attempts at a large scale discourse level annotation for Kannada, which can be used for mantic annotation and corpus development for other tasks in the language
ConfNet2Seq: Full Length Answer Generation from Spoken Questions
VAISHALI PAL,Manish Srivastava,Laurent Besacier
Speech and Dialogue Conference, TSD, 2020
@inproceedings{bib_Conf_2020, AUTHOR = {VAISHALI PAL, Manish Srivastava, Laurent Besacier}, TITLE = {ConfNet2Seq: Full Length Answer Generation from Spoken Questions}, BOOKTITLE = {Speech and Dialogue Conference}. YEAR = {2020}}
Conversational and task-oriented dialogue systems aim to interact with the user using natural responses through multi-modal interfaces, such as text or speech. These desired responses are in the form of full-length natural answers generated over facts retrieved from a knowledge source. While the task of generating natural answers to questions from an answer span has been widely studied, there has been little research on natural sentence generation over spoken content. We propose a novel system to generate full length natural language answers from spoken questions and factoid answers. The spoken sequence is compactly represented as a confusion network extracted from a pre-trained Automatic Speech Recognizer. This is the first attempt towards generating full-length natural answers from a graph input(confusion network) to the best of our knowledge. We release a large-scale dataset of 259,788 samples of spoken questions, their factoid answers and corresponding full-length textual answers. Following our proposed approach, we achieve comparable performance with best ASR hypothesis.
SCAR: Sentence Compression using Autoencoders for Reconstruction
Maniar Tirth Anup,MALIREDDY CHANAKYA,Manish Srivastava
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2020
@inproceedings{bib_SCAR_2020, AUTHOR = {Maniar Tirth Anup, MALIREDDY CHANAKYA, Manish Srivastava}, TITLE = {SCAR: Sentence Compression using Autoencoders for Reconstruction}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2020}}
Sentence compression is the task of shortening a sentence while retaining its meaning. Most methods proposed for this task rely on labeled or paired corpora (containing pairs of verbose and compressed sentences), which is often expensive to collect. To overcome this limitation, we present a novel unsupervised deep learning framework (SCAR) for deletion-based sentence compression. SCAR is primarily composed of two encoder-decoder pairs: a compressor and a reconstructor. The compressor masks the input, and the reconstructor tries to regenerate it. The model is entirely trained on unlabeled data and does not require additional inputs such as explicit syntactic information or optimal compression length. SCAR’s merit lies in the novel Linkage Loss function, which correlates the compressor and its effect on reconstruction, guiding it to drop inferable tokens. SCAR achieves higher ROUGE scores on benchmark datasets than the existing stateof-the-art methods and baselines. We also conduct a user study to demonstrate the application of our model as a text highlighting system. Using our model to underscore salient information facilitates speed-reading and reduces the time required to skim a document.
A Simple and Effective Dependency Parser for Telugu
SNEHA NALLANI,Manish Srivastava,Dipti Mishra Sharma
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2020
@inproceedings{bib_A_Si_2020, AUTHOR = {SNEHA NALLANI, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {A Simple and Effective Dependency Parser for Telugu}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2020}}
We present a simple and effective dependency parser for Telugu, a morphologically rich, free word order language. We propose to replace the rich linguistic feature templates used in the past approaches with a minimal feature function using contextual vector representations. We train a BERT model on the Telugu Wikipedia data and use vector representations from this model to train the parser. Each sentence token is associated with a vector representing the token in the context of that sentence and the feature vectors are constructed by concatenating two token representations from the stack and one from the buffer. We put the feature representations through a feed forward network and train with a greedy transition based approach. The resulting parser has a very simple architecture with minimal feature engineering and achieves state-of-the-art results for Telugu.
Word Embeddings as Tuples of Feature Probabilities
Siddharth Bhat M,Alok Debnath,Souvik Banerjee,Manish Srivastava
International Joint Conference on Natural Language Processing Workshop, IJCNLP-W, 2020
@inproceedings{bib_Word_2020, AUTHOR = {Siddharth Bhat M, Alok Debnath, Souvik Banerjee, Manish Srivastava}, TITLE = {Word Embeddings as Tuples of Feature Probabilities}, BOOKTITLE = {International Joint Conference on Natural Language Processing Workshop}. YEAR = {2020}}
In this paper, we provide an alternate perspective on word representations, by reinterpreting the dimensions of the vector space of a word embedding as a collection of features. In this reinterpretation, every component of the word vector is normalized against all the word vectors in the vocabulary. This idea now allows us to view each vector as an n-tuple (akin to a fuzzy set), where n is the dimensionality of the word representation and each element represents the probability of the word possessing a feature. Indeed, this representation enables the use fuzzy set theoretic operations, such as union, intersection and difference. Unlike previous attempts, we show that this representation of words provides a notion of similarity which is inherently asymmetric and hence closer to human similarity judgements. We compare the performance of this representation with various benchmarks, and explore some of the unique properties including function word detection, detection of polysemous words, and some insight into the interpretability provided by set theoretic operations.
Semantic Textual Similarity of Sentences with Emojis
Alok Debnath,Nikhil Pinnaparaju,Manish Srivastava,Vasudeva Varma Kalidindi,Isabelle Augenstein
International Conference on World wide web, WWW, 2020
@inproceedings{bib_Sema_2020, AUTHOR = {Alok Debnath, Nikhil Pinnaparaju, Manish Srivastava, Vasudeva Varma Kalidindi, Isabelle Augenstein}, TITLE = {Semantic Textual Similarity of Sentences with Emojis}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2020}}
In this paper, we extend the task of semantic textual similarity to include sentences which contain emojis. Emojis are ubiquitous on social media today, but are often removed in the pre-processing stage of curating datasets for NLP tasks. In this paper, we qualitatively ascertain the amount of semantic information lost by discounting emojis, as well as show a mechanism of accounting for emojis in a semantic task. We create a sentence similarity dataset of 4000 pairs of tweets with emojis, which have been annotated for relatedness. The corpus contains tweets curated based on common topic as well as by replacement of emojis. The latter was done to analyze the difference in semantics associated with different emojis. We aim to provide an understanding of the information lost by removing emojis by providing a qualitative analysis of the dataset. We also aim to present a method of using both emojis and words for downstream NLP tasks beyond sentiment analysis.
Fermi at SemEval-2019 Task 6: Identifying and categorizing offensive language in social media using sentence embeddings
I VIJAYASARADHI,Bakhtiyar Hussain Syed,Manish Srivastava,Manish Gupta,Vasudeva Varma Kalidindi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_Ferm_2019, AUTHOR = {I VIJAYASARADHI, Bakhtiyar Hussain Syed, Manish Srivastava, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Fermi at SemEval-2019 Task 6: Identifying and categorizing offensive language in social media using sentence embeddings}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
This paper describes our system (Fermi) for Task 6: OffensEval: Identifying and Categorizing Offensive Language in Social Media of SemEval-2019. We participated in all the three sub-tasks within Task 6. We evaluate multiple sentence embeddings in conjunction with various supervised machine learning algorithms and evaluate the performance of simple yet effective embedding-ML combination algorithms. Our team Fermi’s model achieved an F1-score of 64.40%, 62.00% and 62.60% for sub-task A, B and C respectively on the official leaderboard. Our model for sub-task C which uses pre-trained ELMo embeddings for transforming the input and uses SVM (RBF kernel) for training, scored third position on the official leaderboard. Through the paper we provide a detailed description of the approach, as well as the results obtained for the task.
Fermi at semeval-2019 task 5: Using sentence embeddings to identify hate speech against immigrants and women in twitter
I VIJAYASARADHI,Bakhtiyar Hussain Syed,Manish Srivastava,Nikhil Chakravartula,Manish Gupta,Vasudeva Varma Kalidindi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_Ferm_2019, AUTHOR = {I VIJAYASARADHI, Bakhtiyar Hussain Syed, Manish Srivastava, Nikhil Chakravartula, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Fermi at semeval-2019 task 5: Using sentence embeddings to identify hate speech against immigrants and women in twitter}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
This paper describes our system (Fermi) for Task 5 of SemEval-2019: HatEval: Multilingual Detection of Hate Speech Against Immigrants and Women on Twitter. We participated in the subtask A for English and ranked first in the evaluation on the test set. We evaluate the quality of multiple sentence embeddings and explore multiple training models to evaluate the performance of simple yet effective embedding-ML combination algorithms. Our team-Fermi’s model achieved an accuracy of 65.00% for English language in task A. Our models, which use pretrained Universal Encoder sentence embeddings for transforming the input and SVM (with RBF kernel) for classification, scored first position (among 68) in the leaderboard on the test set for Subtask A in English language. In this paper we provide a detailed description of the approach, as well as the results obtained in the task.
Fermi at SemEval-2019 task 8: An elementary but effective approach to question discernment in community qa forums
Bakhtiyar Hussain Syed,I VIJAYASARADHI,Manish Srivastava,Manish Gupta,Vasudeva Varma Kalidindi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_Ferm_2019, AUTHOR = {Bakhtiyar Hussain Syed, I VIJAYASARADHI, Manish Srivastava, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Fermi at SemEval-2019 task 8: An elementary but effective approach to question discernment in community qa forums}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
Online Community Question Answering Forums (cQA) have gained massive popularity within recent years. The rise in users for such forums have led to the increase in the need for automated evaluation for question comprehension and fact evaluation of the answers provided by various participants in the forum. Our team, Fermi, participated in sub-task A of Task 8 at SemEval 2019-which tackles the first problem in the pipeline of factual evaluation in cQA forums, ie, deciding whether a posed question asks for a factual information, an opinion/advice or is just socializing. This information is highly useful in segregating factual questions from non-factual ones which highly helps in organizing the questions into useful categories and trims down the problem space for the next task in the pipeline for fact evaluation among the available answers. Our system uses the embeddings obtained from Universal Sentence Encoder combined with XGBoost for the classification sub-task A. We also evaluate other combinations of embeddings and off-the-shelf machine learning algorithms to demonstrate the efficacy of the various representations and their combinations. Our results across the evaluation test set gave an accuracy of 84% and received the first position in the final standings judged by the organizers.
Using argumentative semantic feature for summarization
SAKALA VENKATA KRISHNA ROHIT,Manish Srivastava
International Computer Science Conference, ICSC, 2019
@inproceedings{bib_Usin_2019, AUTHOR = {SAKALA VENKATA KRISHNA ROHIT, Manish Srivastava}, TITLE = {Using argumentative semantic feature for summarization}, BOOKTITLE = {International Computer Science Conference}. YEAR = {2019}}
The last decade has witnessed digitization of many government organization’s data. Text summarization of the political discourse, particularly the parliamentary proceedingsis relatively a lesser explored area of research. In this paper,we investigate the role of semantics especially theory of argu-mentation in debate summarization and use it to design a semiautomatic pipeline for generating these summaries. The proposedapproach considers the topic-relevance, argumentative nature,sentiment and context features. We test our approach on thedataset of debates mined from Lok Sabha, the elected house of representatives in India. Our proposed methodology and pipelineshow significant improvement over the high performing popular systems for ROUGE-1, ROUGE-2 and ROUGE-L metrics.
A3-108 Machine Translation System for LoResMT 2019
Saumitra Yadav,Vandan Mujadia,Manish Srivastava
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_A3-1_2019, AUTHOR = {Saumitra Yadav, Vandan Mujadia, Manish Srivastava}, TITLE = {A3-108 Machine Translation System for LoResMT 2019}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
n this paper, we describe our machine translation systems submitted to LoResMT2019 Shared Task. Systems were devel-oped for Bhojpuri, Magahi, Sindhi, Lat-vian⇐⇒(English). This paper outlines preprocessing , configuration of the sub-mitted systems and the results produced using the same
A semantico-syntactic approach to event-mention detection and extraction in hindi
JAIPAL SINGH GOUD,Pranav Goel,Alok Debnath,Suhan Prabhu K,Manish Srivastava
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_A_se_2019, AUTHOR = {JAIPAL SINGH GOUD, Pranav Goel, Alok Debnath, Suhan Prabhu K, Manish Srivastava}, TITLE = {A semantico-syntactic approach to event-mention detection and extraction in hindi}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
This paper introduces a gold standard event-annotated dataset of 810 Hindi news article as well as a set of comprehensive guidelines for detecting and annotating events in Hindi. We present our linguistically motivated guideline development process, with a focus on annotator friendliness, that can be replicated for event mention detection for most Indo-Aryan languages. The paper highlights the challenges of detecting event mentions in Hindi given the unique semantic constraints on the syntactic apparatus used for denoting events. Our work as a whole also establishes a language agnostic pipeline for the development of an event annotated corpora and event detection guidelines.
A Pregroup Representation of Word Order Alternation using Hindi Syntax
Alok Debnath,Manish Srivastava
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_A_Pr_2019, AUTHOR = {Alok Debnath, Manish Srivastava}, TITLE = {A Pregroup Representation of Word Order Alternation using Hindi Syntax}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
Pregroup calculus has been used for the representation of free word order languages (Sanskrit and Hungarian), using a construction called precyclicity. However, restricted word order alternation has not been handled before. This paper aims at introducing and formally expressing three methods of representing word order alternation in the pregroup representation of any language. This paper describes the word order alternation patterns of Hindi, and creates a basic pregroup representation for the language. In doing so, the shortcoming of correct reductions for ungrammatical sentences due to the current apparatus is highlighted, and the aforementioned methods are invoked for a grammatically accurate representation of restricted word order alternation. The replicability of these methods is explained in the representation of adverbs and prepositional phrases in English.
Curriculum Learning Strategies for Hindi-English Codemixed Sentiment Analysis
ANIRUDH DAHIYA,NEERAJ BATTAN,Manish Srivastava,Dipti Mishra Sharma
International Joint Conference on Artificial Intelligence, IJCAI, 2019
@inproceedings{bib_Curr_2019, AUTHOR = {ANIRUDH DAHIYA, NEERAJ BATTAN, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {Curriculum Learning Strategies for Hindi-English Codemixed Sentiment Analysis}, BOOKTITLE = {International Joint Conference on Artificial Intelligence}. YEAR = {2019}}
Sentiment Analysis and other semantic tasks are commonly used for social media textual analysis to gauge public opinion and make sense from the noise on social media. The language used on social media not only commonly diverges from the formal language, but is compounded by codemixing between languages, especially in large multilingual societies like India. Traditional methods for learning semantic NLP tasks have long relied on end to end task specific training, requiring expensive data creation process, even more so for deep learning methods. This challenge is even more severe for resource scarce texts like codemixed language pairs, with lack of well learnt representations as model priors, and task specific datasets can be few and small in quantities to efficiently exploit recent deep learning approaches. To address above challenges, we introduce curriculum learning strategies for semantic tasks in code-mixed Hindi-English (Hi-En) texts, and investigate various training strategies for enhancing model performance. Our method outperforms the state of the art methods for Hi-En codemixed sentiment analysis by 3.31% accuracy, and also shows better model robustness in terms of convergence, and variance in test performance.
Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data
SRIRANGAM VAMSHI KRISHNA,APPIDI ABHINAV REDDY,VINAY KUMAR SINGH,Manish Srivastava
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_Corp_2019, AUTHOR = {SRIRANGAM VAMSHI KRISHNA, APPIDI ABHINAV REDDY, VINAY KUMAR SINGH, Manish Srivastava}, TITLE = {Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
Named Entity Recognition(NER) is one of the important tasks in Natural Language Processing(NLP) and also is a subtask of Information Extraction. In this paper we present our work on NER in Telugu-English code-mixed social media data. Code-Mixing, a progeny of multilingualism is a way in which multilingual people express themselves on social media by using linguistics units from different languages within a sentence or speech context. Entity Extraction from social media data such as tweets(twitter) is in general difficult due to its informal nature, code-mixed data further complicates the problem due to its informal, unstructured and incomplete information. We present a Telugu-English code-mixed corpus with the corresponding named entity tags. The named entities used to tag data are Person(‘Per’), Organization(‘Org’) and Location(‘Loc’). We experimented with the machine learning models Conditional Random Fields(CRFs), Decision Trees and BiLSTMs on our corpus which resulted in a F1-score of 0.96, 0.94 and 0.95 respectively.
De-Mixing Sentiment from Code-Mixed Text
Yash Kumar Lal,Vaibhav Kumar,MRINAL DHAR,Manish Srivastava,Philipp Koehn
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_De-M_2019, AUTHOR = {Yash Kumar Lal, Vaibhav Kumar, MRINAL DHAR, Manish Srivastava, Philipp Koehn}, TITLE = {De-Mixing Sentiment from Code-Mixed Text}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
Code-mixing is the phenomenon of mixing the vocabulary and syntax of multiple languages in the same sentence. It is an increasingly common occurrence in today’s multilingual society and poses a big challenge when encountered in different downstream tasks. In this paper, we present a hybrid architecture for the task of Sentiment Analysis of English-Hindi code-mixed data. Our method consists of three components, each seeking to alleviate different issues. We first generate subword level representations for the sentences using a CNN architecture. The generated representations are used as inputs to a Dual Encoder Network which consists of two different BiLSTMs - the Collective and Specific Encoder. The Collective Encoder captures the overall sentiment of the sentence, while the Specific Encoder utilizes an attention mechanism in order to focus on individual sentiment-bearing sub-words. This, combined with a Feature Network consisting of orthographic features and specially trained word embeddings, achieves state-of-the-art results - 83.54% accuracy and 0.827 F1 score - on a benchmark dataset.
Using Syntax to Resolve NPE in English
Manish Srivastava,PAYAL KULLAR,ALLEN JOJO ANTONY
Recent advance in Natural language Processing, RANLP, 2019
@inproceedings{bib_Usin_2019, AUTHOR = {Manish Srivastava, PAYAL KULLAR, ALLEN JOJO ANTONY}, TITLE = {Using Syntax to Resolve NPE in English}, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2019}}
This paper describes a novel, syntax-basedsystem for automatic detection and resolu-tion of Noun Phrase Ellipsis (NPE) in En-glish. The system takes in free input English text, detects the site of nominal elision, and if present, selects potential antecedent candi-dates. The rules are built using the syntactic information on ellipsis and its antecedent dis-cussed in previous theoretical linguistics lit-erature on NPE. Additionally, we prepare acurated dataset of 337 sentences from well-known, reliable sources, containing positiveand negative samples of NPE. We split thisdataset into two parts, and use one part to re-fine our rules and the other to test the perfor-mance of our final system. We get an F1-scoreof 76.47% for detection and 70.27% for NPEresolution on the testset. To the best of ourknowledge, ours is the first system that de-tects and resolves NPE in English. The curateddataset used for this task, albeit small, coversa wide variety of NPE cases and will be madepublic for future work.
Predicting Algorithm Classes for Programming Word Problems
ATHAVALE VINAYAK SANJAY,AAYUSH NAIK,VANJAPE RAJAS MANGESH,Manish Srivastava
Conference of the North American Chapter of the Association for Computational Linguistics, NAACL, 2019
@inproceedings{bib_Pred_2019, AUTHOR = {ATHAVALE VINAYAK SANJAY, AAYUSH NAIK, VANJAPE RAJAS MANGESH, Manish Srivastava}, TITLE = {Predicting Algorithm Classes for Programming Word Problems}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics}. YEAR = {2019}}
We introduce the task of algorithm class prediction for programming word problems. A programming word problem is a problem written in natural language, which can be solved using an algorithm or a program. We define classes of various programming word problems which correspond to the class of algorithms required to solve the problem. We present four new datasets for this task, two multiclass datasets with 550 and 1159 problems each and two multilabel datasets having 3737 and 3960 problems each. We pose the problem as a text classification problem and train neural network and non-neural network based models on this task. Our best performing classifier gets an accuracy of 62.7 percent for the multiclass case on the five class classification dataset, Codeforces Multiclass-5 (CFMC5). We also do some human-level analysis and compare human performance with that of our text classification models. Our best classifier has an accuracy only 9 percent lower than that of a human on this task. To the best of our knowledge, these are the first reported results on such a task. We make our code and datasets publicly available.
Answering Naturally: Factoid to Full length Answer Generation
VAISHALI PAL,Manish Srivastava,IRSHAD AHMAD BHAT
Conference of the North American Chapter of the Association for Computational Linguistics, NAACL, 2019
@inproceedings{bib_Answ_2019, AUTHOR = {VAISHALI PAL, Manish Srivastava, IRSHAD AHMAD BHAT}, TITLE = {Answering Naturally: Factoid to Full length Answer Generation}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics}. YEAR = {2019}}
In recent years, the task of Question Answering over passages, also pitched as a reading comprehension, has evolved into a very active research area. A reading comprehension system extracts a span of text, comprising of named entities, dates, small phrases, etc., which serve as the answer to a given question. However, these spans of text would result in an unnatural reading experience in a conversational system. Usually, dialogue systems solve this issue by using template-based language generation. These systems, though adequate for a domain specific task, are too restrictive and predefined for a domain independent system. In order to present the user with a more conversational experience, we propose a pointer generator based full-length answer generator which can be used with most QA systems. Our system generates a full-length answer given a question and the extracted factoid/span answer without relying on the passage from where the answer was extracted. We also present a dataset of 315,000 question, factoid answer and full-length answer triples. We have evaluated our system using ROUGE1,2,L and BLEU and achieved 74.05 BLEU score and 86.25 Rogue-L score.
Transition-based deep input linearization
PUDUPPULLY RATISH SURENDRAN,Yue Zhang,Manish Srivastava
Conference of the European Chapter of the Association for Computational Linguistics (EACL), EACL, 2019
@inproceedings{bib_Tran_2019, AUTHOR = {PUDUPPULLY RATISH SURENDRAN, Yue Zhang, Manish Srivastava}, TITLE = {Transition-based deep input linearization}, BOOKTITLE = {Conference of the European Chapter of the Association for Computational Linguistics (EACL)}. YEAR = {2019}}
Traditional methods for deep NLG adopt pipeline approaches comprising stages such as constructing syntactic input, predicting function words, linearizing the syntactic input and generating the surface forms. Though easier to visualize, pipeline approaches suffer from error propagation. In addition, information available across modules cannot be leveraged by all modules. We construct a transition-based model to jointly perform linearization, function word prediction and morphological generation, which considerably improves upon the accuracy compared to a pipelined baseline system. On a standard deep input linearization shared task, our system achieves the best results reported so far.
Inductive Transfer Learning for Detection of Well-formed Natural Language Search Queries
Bakhtiyar Hussain Syed,I VIJAYASARADHI,Manish Gupta,Manish Srivastava,Vasudeva Varma Kalidindi
European Conference on Information Retrieval, ECIR, 2019
@inproceedings{bib_Indu_2019, AUTHOR = {Bakhtiyar Hussain Syed, I VIJAYASARADHI, Manish Gupta, Manish Srivastava, Vasudeva Varma Kalidindi}, TITLE = {Inductive Transfer Learning for Detection of Well-formed Natural Language Search Queries}, BOOKTITLE = {European Conference on Information Retrieval}. YEAR = {2019}}
Users have been trained to type keyword queries on search engines. However, recently there has been a significant rise in the number of verbose queries. Often times such queries are not well-formed. The lack of well-formedness in the query might adversely impact the downstream pipeline which processes these queries. A well-formed natural language question as a search query aids heavily in reducing errors in downstream tasks and further helps in improved query understanding. In this paper, we employ an inductive transfer learning technique by fine-tuning a pretrained language model to identify whether a search query is a well-formed natural language question or not. We show that our model trained on a recently released benchmark dataset spanning 25,100 queries gives an accuracy of 75.03% thereby improving by ∼5 absolute percentage points over the state-of-the-art.
Using Sentence Embeddings to identify Hate Speech against Immigrants and Women on Twitter
I VIJAYASARADHI,Bakhtiyar Hussain Syed,Manish Srivastava,Nikhil Chakravartula,Manish Gupta,Vasudeva Varma Kalidindi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_Usin_2019, AUTHOR = {I VIJAYASARADHI, Bakhtiyar Hussain Syed, Manish Srivastava, Nikhil Chakravartula, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Using Sentence Embeddings to identify Hate Speech against Immigrants and Women on Twitter}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
This paper describes our system (Fermi) for Task 5 of SemEval-2019: HatEval: Multilingual Detection of Hate Speech Against Immigrants and Women on Twitter. We participated in the subtask A for English and ranked first in the evaluation on the test set. We evaluate the quality of multiple sentence embeddings and explore multiple training models to evaluate the performance of simple yet effective embedding-ML combination algorithms. Our team - Fermi’s model achieved an accuracy of 65.00% for English language in task A. Our models, which use pretrained Universal Encoder sentence embeddings for transforming the input and SVM (with RBF kernel) for classification, scored first position (among 68) in the leaderboard on the test set for Subtask A in English language. In this paper we provide a detailed description of the approach, as well as the results obtained in the task.
Identifying and Categorizing Offensive Language in Social Media using Sentence Embeddings
I VIJAYASARADHI,Bakhtiyar Hussain Syed,Manish Srivastava,Manish Gupta,Vasudeva Varma Kalidindi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_Iden_2019, AUTHOR = {I VIJAYASARADHI, Bakhtiyar Hussain Syed, Manish Srivastava, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {Identifying and Categorizing Offensive Language in Social Media using Sentence Embeddings}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
This paper describes our system (Fermi) for Task 6: OffensEval: Identifying and Categorizing Offensive Language in Social Media of SemEval-2019. We participated in all the three sub-tasks within Task 6. We evaluate multiple sentence embeddings in conjunction with various supervised machine learning algorithms and evaluate the performance of simple yet effective embedding-ML combination algorithms. Our team (Fermi)’s model achieved an F1-score of 64.40%, 62.00% and 62.60% for sub-task A, B and C respectively on the official leaderboard. Our model for subtask C which uses pretrained ELMo embeddings for transforming the input and uses SVM (RBF kernel) for training, scored third position on the official leaderboard. Through the paper we provide a detailed description of the approach, as well as the results obtained for the task.
An elementary but effective approach to Question Discernment in Community QA Forums.
Bakhtiyar Hussain Syed,I VIJAYASARADHI,Manish Srivastava,Manish Gupta,Vasudeva Varma Kalidindi
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_An_e_2019, AUTHOR = {Bakhtiyar Hussain Syed, I VIJAYASARADHI, Manish Srivastava, Manish Gupta, Vasudeva Varma Kalidindi}, TITLE = {An elementary but effective approach to Question Discernment in Community QA Forums.}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
Online Community Question Answering Forums (cQA) have gained massive popularity within recent years. The rise in users for such forums have led to the increase in the need for automated evaluation for question comprehension and fact evaluation of the answers provided by various participants in the forum. Our team, Fermi, participated in sub-task A of Task 8 at SemEval 2019 - which tackles the first problem in the pipeline of factual evaluation in cQA forums, i.e., deciding whether a posed question asks for a factual information, an opinion/advice or is just socializing. This information is highly useful in segregating factual questions from non-factual ones which highly helps in organizing the questions into useful categories and trims down the problem space for the next task in the pipeline for fact evaluation among the available answers. Our system uses the embeddings obtained from Universal Sentence Encoder combined with XGBoost for the classification subtask A. We also evaluate other combinations of embeddings and off-the-shelf machine learning algorithms to demonstrate the efficacy of the various representations and their combinations. Our results across the evaluation test set gave an accuracy of 84% and received the first position in the final standings judged by the organizers.
Automatic Normalization of Word Variations in Code-Mixed Social Media Text
RAJAT SINGH,NURENDRA CHOUDHARY,Manish Shrivastava
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2018
Abs | | bib Tex
@inproceedings{bib_Auto_2018, AUTHOR = {RAJAT SINGH, NURENDRA CHOUDHARY, Manish Shrivastava}, TITLE = {Automatic Normalization of Word Variations in Code-Mixed Social Media Text}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2018}}
Social media platforms such as Twitter and Facebook are becoming popular in multilingual societies. This trend induces portmanteau of South Asian languages with English. The blend of multiple languages as code-mixed data has recently become popular in research communities for various NLP tasks. Code-mixed data consist of anomalies such as grammatical errors and spelling variations. In this paper, we leverage the contextual property of words where the different spelling variation of words share similar context in a large noisy social media text. We capture different variations of words belonging to same context in an unsupervised manner using distributed representations of words. Our experiments reveal that preprocessing of the code-mixed dataset based on our approach improves the performance in state-of-the-art part-of-speech tagging (POS-tagging) and sentiment analysis tasks.
Contrastive Learning of Emoji-Based Representations for Resource-Poor Languages
NURENDRA CHOUDHARY,RAJAT SINGH,Ishita Bindlish,Manish Shrivastava
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2018
Abs | | bib Tex
@inproceedings{bib_Cont_2018, AUTHOR = {NURENDRA CHOUDHARY, RAJAT SINGH, Ishita Bindlish, Manish Shrivastava}, TITLE = {Contrastive Learning of Emoji-Based Representations for Resource-Poor Languages}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2018}}
CESNA model consists of twin Bi-directional Long Short-Term Memory Recurrent Neural Networks (Bi-LSTM RNN) with shared parameters joined by a contrastive loss function based on a similarity metric. The model learns the representations of resource-poor and resource-rich language in a common emoji space by using a similarity metric based on the emojis present in sentences from both languages. The model, hence, projects sentences with similar emojis closer to each other and the sentences with different emojis farther from one another. Experiments on large-scale Twitter datasets of resource-rich languages - English and Spanish and resource-poor languages - Hindi and Telugu reveal that CESNA outperforms the state-of-the-art emoji prediction approaches based on distributional semantics,
Poster: DWEN: Deep Word Embedding Network for Duplicate Bug Report Detection in Software Repositories
Amar Budhiraja,KARTIK DUTTA,Raghu Babu Reddy Y,Manish Shrivastava
International Conference on Software Engineering - Companion, ICSE - Companion, 2018
@inproceedings{bib_Post_2018, AUTHOR = {Amar Budhiraja, KARTIK DUTTA, Raghu Babu Reddy Y, Manish Shrivastava}, TITLE = {Poster: DWEN: Deep Word Embedding Network for Duplicate Bug Report Detection in Software Repositories}, BOOKTITLE = {International Conference on Software Engineering - Companion}. YEAR = {2018}}
Bug report filing is a major part of software maintenance. Due to the asynchronous nature of the bug filing process, duplicate bug reports are filed. Detecting duplicate bug reports is an important aspect of software maintenance since the same bug should not be assigned to different developers. In this poster, we present Deep Word Embedding Network for computing similarity between two bug reports for the task of duplicate bug report detection. We propose to learn a two step model to calculate similarity between two bug reports by means of word embeddings and a deep neural network. We run experiments on two large datasets of Mozilla Project and Open Office Project and compare the proposed approach with baselines and related approaches. Through this initial work, we show that a combination of word embeddings and deep neural networks can be used to improve duplicate bug report detection.
RARE : A Recurrent Attentive Recommendation Engine for News Aggregators
DHRUV KHATTAR,VAIBHAV KUMAR,SHASHANK GUPTA,Manish Shrivastava,Vasudeva Varma Kalidindi
International Conference on Information and Knowledge Management, CIKM, 2018
@inproceedings{bib_RARE_2018, AUTHOR = {DHRUV KHATTAR, VAIBHAV KUMAR, SHASHANK GUPTA, Manish Shrivastava, Vasudeva Varma Kalidindi}, TITLE = {RARE : A Recurrent Attentive Recommendation Engine for News Aggregators}, BOOKTITLE = {International Conference on Information and Knowledge Management}. YEAR = {2018}}
With news stories coming from a variety of sources, it is crucial for news aggregators to present interesting articles to the user to maximize their engagement. This creates the need for a news recommendation system which understands the content of the articles as well as accounts for the users’ preferences. Methods such as Collaborative Filtering, which are well known for general recommendations, are not suitable for news because of the short life span of articles and because of the large number of articles published each day. Apart from this, such methods do not harness the information present in the sequence in which the articles are read by the user and hence are unable to account for the specific and generic interests of the user which may keep changing with time. In order to address these issues for news recommendation, we propose the Recurrent Attentive Recommendation Engine (RARE). RARE consists of two components and utilizes the distributed representations of news articles. The first component is used to model the user’s sequential behaviour of news reading in order to understand her general interests, i.e., to get a summary of her interests. The second component utilizes an article level attention mechanism to understand her specific interests. We feed the information obtained from both the components to a Siamese Network in order to make predictions which pertain to
Real-time Indoor Theft Detection System Using Computer-Vision
Manish Srivastava,Princy Matlani
International Journal of Scientific Research in Computer Science, Engineering and Information Techno, IJSRCSEIT, 2018
@inproceedings{bib_Real_2018, AUTHOR = {Manish Srivastava, Princy Matlani}, TITLE = {Real-time Indoor Theft Detection System Using Computer-Vision}, BOOKTITLE = {International Journal of Scientific Research in Computer Science, Engineering and Information Techno}. YEAR = {2018}}
Real-time Indoor theft detection from surveillancevideos is not only a challenging problem of object detection and human activity recognition in the field of computer vision, but also an urgent need for preventing theft crimes in real life. The system uses digital cameras to scan the faces of people approaching a security gate i.e. the entry gate, and then matches its faces with the faces in the database, and if found not to be a known one it automatically generates an alert. In this paper, we propose a framework for real-time indoor theft detection based on the combining result of face recognition and pattern matching by analyzing the activities with that of thieves. At last, if detected abnormal it automatically sends a message using Multimedia Message Service with the help of GPRS/GSM modem.
DWEN: deep word embedding network for duplicate bug report detection in software repositories
AMAR BUDHIRAJA,KARTIK DUTTA,Raghu Babu Reddy Y,Manish Srivastava
International Conference on Software Engineering, ICSE, 2018
@inproceedings{bib_DWEN_2018, AUTHOR = {AMAR BUDHIRAJA, KARTIK DUTTA, Raghu Babu Reddy Y, Manish Srivastava}, TITLE = {DWEN: deep word embedding network for duplicate bug report detection in software repositories}, BOOKTITLE = {International Conference on Software Engineering}. YEAR = {2018}}
Bug report filing is a major part of software maintenance. Due to extensive number of bugs filed everyday in large software projects and the asynchronous nature of bug report filing ecosystem, duplicate bug reports are filed. Capturing and tagging duplicate bug reports is crucial in order to avoid assignment of the same bug to different developers. Efforts have been made in the past to detect duplicate bug reports by using topic modelling [2], discriminative methods [5], meta-attributes [6], etc. Recently, Yang et al.[8] proposed an approach to combine word embeddings, TF-IDF and meta-attributes to compute bug similarity between two bug reports.
Lwe: Lda refined word embeddings for duplicate bug report detection
AMAR BUDHIRAJA,Raghu Babu Reddy Y,Manish Srivastava
International Conference on Software Engineering, ICSE, 2018
@inproceedings{bib_Lwe:_2018, AUTHOR = {AMAR BUDHIRAJA, Raghu Babu Reddy Y, Manish Srivastava}, TITLE = {Lwe: Lda refined word embeddings for duplicate bug report detection}, BOOKTITLE = {International Conference on Software Engineering}. YEAR = {2018}}
Bug reporting is a major part of software maintenance and due to its inherently asynchronous nature, duplicate bug reporting has become fairly common. Detecting duplicate bug reports is an important task in order to avoid the assignment of a same bug to different developers. Earlier approaches have improved duplicate bug report detection by using the notions of word embeddings, topic models and other machine learning approaches. In this poster, we attempt to combine Latent Dirichlet Allocation (LDA) and word embeddings to leverage the strengths of both approaches for this task. As a first step towards this idea, we present initial analysis and an approach which is able to outperform both word embeddings and LDA for this task. We validate our hypothesis on a real world dataset of Firefox project and show that there is potential in combining both LDA and word embeddings for duplicate bug report detection.
A corpus of English-Hindi code-mixed tweets for sarcasm detection
SAHIL SWAMI,ANKUSH KHANDELWAL,VINAY KUMAR SINGH,SYED SARFARAZ AKHTAR,Manish Srivastava
Technical Report, arXiv, 2018
@inproceedings{bib_A_co_2018, AUTHOR = {SAHIL SWAMI, ANKUSH KHANDELWAL, VINAY KUMAR SINGH, SYED SARFARAZ AKHTAR, Manish Srivastava}, TITLE = {A corpus of English-Hindi code-mixed tweets for sarcasm detection}, BOOKTITLE = {Technical Report}. YEAR = {2018}}
Social media platforms like twitter and facebook have be-come two of the largest mediums used by people to express their views to-wards different topics. Generation of such large user data has made NLP tasks like sentiment analysis and opinion mining much more important. Using sarcasm in texts on social media has become a popular trend lately. Using sarcasm reverses the meaning and polarity of what is implied by the text which poses challenge for many NLP tasks. The task of sarcasm detection in text is gaining more and more importance for both commer-cial and security services. We present the first English-Hindi code-mixed dataset of tweets marked for presence of sarcasm and irony where each token is also annotated with a language tag. We present a baseline su-pervised classification system developed using the same dataset which achieves an average F-score of 78.4 after using random forest classifier and performing 10-fold cross validation.
An English-Hindi code-mixed corpus: Stance annotation and baseline system
SAHIL SWAMI,ANKUSH KHANDELWAL,VINAY KUMAR SINGH,SYED SARFARAZ AKHTAR,Manish Srivastava
Technical Report, arXiv, 2018
@inproceedings{bib_An_E_2018, AUTHOR = {SAHIL SWAMI, ANKUSH KHANDELWAL, VINAY KUMAR SINGH, SYED SARFARAZ AKHTAR, Manish Srivastava}, TITLE = {An English-Hindi code-mixed corpus: Stance annotation and baseline system}, BOOKTITLE = {Technical Report}. YEAR = {2018}}
Social media has become one of the main channels for peo-ple to communicate and share their views with the society. We can often detect from these views whether the person is in favor, against or neu-tral towards a given topic. These opinions from social media are very useful for various companies. We present a new dataset that consists of 3545 English-Hindi code-mixed tweets with opinion towards Demoneti-sation that was implemented in India in 2016 which was followed by a large countrywide debate. We present a baseline supervised classification system for stance detection developed using the same dataset that uses various machine learning techniques to achieve an accuracy of 58.7% on 10-fold cross validation.
A dataset of Hindi-English code-mixed social media text for hate speech detection
ADITYA BOHRA,DEEPANSHU VIJAY,VINAY KUMAR SINGH,SYED SARFARAZ AKHTAR,Manish Srivastava
Conference of the North American Chapter of the Association for Computational Linguistics Workshops, NAACL-W, 2018
@inproceedings{bib_A_da_2018, AUTHOR = {ADITYA BOHRA, DEEPANSHU VIJAY, VINAY KUMAR SINGH, SYED SARFARAZ AKHTAR, Manish Srivastava}, TITLE = {A dataset of Hindi-English code-mixed social media text for hate speech detection}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics Workshops}. YEAR = {2018}}
Hate speech detection in social media texts is an important Natural language Processing task, which has several crucial applications like sentiment analysis, investigating cyberbullying and examining socio-political controversies. While relevant research has been done independently on code-mixed social media texts and hate speech detection, our work is the first attempt in detecting hate speech in Hindi-English code-mixed social media text. In this paper, we analyze the problem of hate speech detection in code-mixed texts and present a Hindi-English code-mixed dataset consisting of tweets posted online on Twitter. The tweets are annotated with the language at word level and the class they belong to (Hate Speech or Normal Speech). We also propose a supervised classification system for detecting hate speech in the text using various character level, word level, and lexicon based features.
Emotions are universal: Learning sentiment based representations of resource-poor languages using siamese networks
NURENDRA CHOUDHARY,RAJAT SINGH,ISHITA BINDLISH,Manish Srivastava
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2018
@inproceedings{bib_Emot_2018, AUTHOR = {NURENDRA CHOUDHARY, RAJAT SINGH, ISHITA BINDLISH, Manish Srivastava}, TITLE = {Emotions are universal: Learning sentiment based representations of resource-poor languages using siamese networks}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2018}}
Machine learning approaches in sentiment analysis principally rely on the abundance of resources. To limit this dependence, we propose a novel method called Siamese Network Architecture for Sentiment Analysis (SNASA) to learn representations of resource-poor languages by jointly training them with resource-rich languages using a siamese network. SNASA model consists of twin Bi-directional Long Short-Term Memory Recurrent Neural Networks (Bi-LSTM RNN) with shared parameters joined by a contrastive loss function, based on a similarity metric. The model learns the sentence representations of resource-poor and resource-rich language in a common sentiment space by using a similarity metric based on their individual sentiments. The model, hence, projects sentences with similar sentiment closer to each other and the sentences with different sentiment farther from each other. Experiments on large-scale datasets of resource-rich languages-English and Spanish and resource-poor languages-Hindi and Telugu reveal that SNASA outperforms the state-of-the-art sentiment analysis approaches based on distributional semantics, semantic rules, lexicon lists and deep neural network representations without shared parameters.
Universal Dependency Parsing for Hindi-English Code-Switching
IRSHAD AHMAD BHAT,Riyaz Ahmad Bhat,Manish Srivastava,Dipti Mishra Sharma
Conference of the North American Chapter of the Association for Computational Linguistics, NAACL, 2018
@inproceedings{bib_Univ_2018, AUTHOR = {IRSHAD AHMAD BHAT, Riyaz Ahmad Bhat, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {Universal Dependency Parsing for Hindi-English Code-Switching}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics}. YEAR = {2018}}
Code-switching is a phenomenon of mixing grammatical structures of two or more languages under varied social constraints. The code-switching data differ so radically from the benchmark corpora used in NLP community that the application of standard technologies to these data degrades their performance sharply. Unlike standard corpora, these data often need to go through additional processes such as language identification, normalization and/or back-transliteration for their efficient processing. In this paper, we investigate these indispensable processes and other problems associated with syntactic parsing of code-switching data and propose methods to mitigate their effects. In particular, we study dependency parsing of code-switching data of Hindi and English multilingual speakers from Twitter. We present a treebank of Hindi-English code-switching tweets under Universal Dependencies scheme and propose a neural stacking model for parsing that efficiently leverages part-of-speech tag and syntactic tree annotations in the code-switching treebank and the preexisting Hindi and English treebanks. We also present normalization and back-transliteration models with a decoding process tailored for code-switching data. Results show that our neural stacking parser is 1.5% LAS points better than the augmented parsing model and our decoding process improves results by 3.8% LAS points over the first-best normalization and/or backtransliteration.
A Dataset for Detecting Irony in Hindi-English Code-Mixed Social Media Text.
DEEPANSHU VIJAY,ADITYA BOHRA,VINAY KUMAR SINGH,SYED SARFARAZ AKHTAR,Manish Srivastava
Extended Semantic Web Conference, ESWC, 2018
@inproceedings{bib_A_Da_2018, AUTHOR = {DEEPANSHU VIJAY, ADITYA BOHRA, VINAY KUMAR SINGH, SYED SARFARAZ AKHTAR, Manish Srivastava}, TITLE = {A Dataset for Detecting Irony in Hindi-English Code-Mixed Social Media Text.}, BOOKTITLE = {Extended Semantic Web Conference}. YEAR = {2018}}
Irony is one of many forms of figurative languages. Irony detection is crucial for Natural Language Processing (NLP) tasks like sentiment analysis and opinion mining. From cognitive point of view, it is a challenge to study how human use irony as a communication tool. While relevant research has been done independently on code-mixed social media texts and irony detection, our work is the first attempt in detecting irony in Hindi-English code-mixed social media text. In this paper, we study the problem of automatic irony detection as a classification problem and present a Hindi-English code-mixed dataset consisting of tweets posted online on Twitter. The tweets are annotated with the language at word level and the class they belong to (Ironic or Non-Ironic). We also propose a supervised classification system for detecting irony in the text using various character level, word level, and structural features.
Cross-Lingual Task-Specific Representation Learning for Text Classification in Resource Poor Languages
NURENDRA CHOUDHARY,RAJAT SINGH,Manish Srivastava
International Joint Conference on Artificial Intelligence, IJCAI, 2018
@inproceedings{bib_Cros_2018, AUTHOR = {NURENDRA CHOUDHARY, RAJAT SINGH, Manish Srivastava}, TITLE = {Cross-Lingual Task-Specific Representation Learning for Text Classification in Resource Poor Languages}, BOOKTITLE = {International Joint Conference on Artificial Intelligence}. YEAR = {2018}}
Neural network models have shown promising results for text classification. However, these solutions are limited by their dependence on the availability of annotated data. The prospect of leveraging resource-rich languages to enhance the text classification of resource-poor languages is fascinating. The performance on resource-poor languages can significantly improve if the resource availability constraints can be offset. To this end, we present a twin Bidirectional Long Short Term Memory (Bi-LSTM) network with shared parameters consolidated by a contrastive loss function (based on a similarity metric). The model learns the representation of resource-poor and resource-rich sentences in a common space by using the similarity between their assigned annotation tags. Hence, the model projects sentences with similar tags closer and those with different tags farther from each other. We evaluated our model on the classification tasks of sentiment analysis and emoji prediction for resource-poor languages-Hindi and Telugu and resource-rich languages-English and Spanish. Our model significantly outperforms the state-of-the-art approaches in both the tasks across all metrics.
Degree based classification of harmful speech using twitter data
SANJANA SHARMA,SAKSHAM AGRAWAL,Manish Srivastava
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying, TRAC, 2018
@inproceedings{bib_Degr_2018, AUTHOR = {SANJANA SHARMA, SAKSHAM AGRAWAL, Manish Srivastava}, TITLE = {Degree based classification of harmful speech using twitter data}, BOOKTITLE = {Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying}. YEAR = {2018}}
Harmful speech has various forms and it has been plaguing the social media in different ways. If we need to crackdown different degrees of hate speech and abusive behavior amongst it, the classification needs to be based on complex ramifications which needs to be defined and hold accountable for, other than racist, sexist or against some particular group and community. This paper primarily describes how we created an ontological classification of harmful speech based on degree of hateful intent, and used it to annotate twitter data accordingly. The key contribution of this paper is the new dataset of tweets we created based on ontological classes and degrees of harmful speech found in the text. We also propose supervised classification system for recognizing these respective harmful speech classes in the texts hence.
Gender Prediction in English-Hindi Code-Mixed Social Media Content: Corpus and Baseline System
ANKUSH KHANDELWAL,SAHIL SWAMI,SYED SARFARAZ AKHTAR,Manish Srivastava
Computacion y Sistemas, CyS, 2018
@inproceedings{bib_Gend_2018, AUTHOR = {ANKUSH KHANDELWAL, SAHIL SWAMI, SYED SARFARAZ AKHTAR, Manish Srivastava}, TITLE = {Gender Prediction in English-Hindi Code-Mixed Social Media Content: Corpus and Baseline System}, BOOKTITLE = {Computacion y Sistemas}. YEAR = {2018}}
The rapid expansion in the usage of social media networking sites leads to a huge amount of unprocessed user generated data which can be used for text mining. Author profiling is the problem of automatically determining profiling aspects like the author's gender and age group through a text is gaining much popularity in computational linguistics. Most of the past research in author profiling is concentrated on English texts cite{1,2}. However many users often change the language while posting on social media which is called code-mixing, and it develops some challenges in the field of text classification and author profiling like variations in spelling, non-grammatical structure and transliteration cite{3}. There are very few English-Hindi code-mixed annotated datasets of social media content present online cite{4}. In this paper, we analyze the task of author's gender prediction in code-mixed content and present a corpus of English-Hindi texts collected from Twitter which is annotated with author's gender. We also explore language identification of every word in this corpus. We present a supervised classification baseline system which uses various machine learning algorithms to identify the gender of an author using a text, based on character and word level features.
Humor detection in english-hindi code-mixed social media content: Corpus and baseline system
ANKUSH KHANDELWAL,SAHIL SWAMI,SYED SARFARAZ AKHTAR,Manish Srivastava
International Conference on Language Resources and Evaluation, LREC, 2018
@inproceedings{bib_Humo_2018, AUTHOR = {ANKUSH KHANDELWAL, SAHIL SWAMI, SYED SARFARAZ AKHTAR, Manish Srivastava}, TITLE = {Humor detection in english-hindi code-mixed social media content: Corpus and baseline system}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2018}}
The tremendous amount of user generated data through social networking sites led to the gaining popularity of automatic text classification in the field of computational linguistics over the past decade. Within this domain, one problem that has drawn the attention of many researchers is automatic humor detection in texts. In depth semantic understanding of the text is required to detect humor which makes the problem difficult to automate. With increase in the number of social media users, many multilingual speakers often interchange between languages while posting on social media which is called code-mixing. It introduces some challenges in the field oflinguistic analysis of social media content (Barman et al., 2014), like spelling variations and non-grammatical structures in a sentence.Past researches include detecting puns in texts (Kao et al., 2016) and humor in one-lines (Mihalcea et al., 2010) in a single language,but with the tremendous amount of code-mixed data available online, there is a need to develop techniques which detects humor incode-mixed tweets. In this paper, we analyze the task of humor detection in texts and describe a freely available corpus containingEnglish-Hindi code-mixed tweets annotated with humorous(H) or non-humorous(N) tags. We also tagged the words in the tweets with Language tags (English/Hindi/Others). Moreover, we describe the experiments carried out on the corpus and provide a baseline classification system which distinguishes between humorous and non-humorous texts
Automatic question generation using relative pronouns and adverbs
PAYAL KULLAR,Konigari Rachna,MUKUL NITIN HASE,Manish Srivastava
Student Research Workshop, SRW, 2018
@inproceedings{bib_Auto_2018, AUTHOR = {PAYAL KULLAR, Konigari Rachna, MUKUL NITIN HASE, Manish Srivastava}, TITLE = {Automatic question generation using relative pronouns and adverbs}, BOOKTITLE = {Student Research Workshop}. YEAR = {2018}}
This paper presents a system that automatically generates multiple, natural language questions using relative pronouns and relative adverbs from complex English sentences. Our system is syntax-based, runs on dependency parse information of a single-sentence input, and achieves high accuracy in terms of syntactic correctness, semantic adequacy, fluency and uniqueness. One of the key advantages of our system, in comparison with other rule-based approaches, is that we nearly eliminate the chances of getting a wrong wh-word in the generated question, by fetching the requisite wh-word from the input sentence itself. Depending upon the input, we generate both factoid and descriptive type questions. To the best of our information, the exploitation of wh-pronouns and wh-adverbs to generate questions is novel in the Automatic Question Generation task.
Named entity recognition for hindi-english code-mixed social media text
VINAY KUMAR SINGH,DEEPANSHU VIJAY,SYED SARFARAZ AKHTAR,Manish Srivastava
Seventh Named Entities Workshop, ACL-NEWS, 2018
@inproceedings{bib_Name_2018, AUTHOR = {VINAY KUMAR SINGH, DEEPANSHU VIJAY, SYED SARFARAZ AKHTAR, Manish Srivastava}, TITLE = {Named entity recognition for hindi-english code-mixed social media text}, BOOKTITLE = {Seventh Named Entities Workshop}. YEAR = {2018}}
Named Entity Recognition (NER) is a major task in the field of Natural Language Processing (NLP), and also is a sub-task of Information Extraction. The challenge of NER for tweets lie in the insufficient information available in a tweet. There has been a significant amount of work done related to entity extraction, but only for resource rich languages and domains such as newswire. Entity extraction is, in general, a challenging task for such an informal text, and code-mixed text further complicates the process with it’s unstructured and incomplete information. We propose experiments with different machine learning classification algorithms with word, character and lexical features. The algorithms we experimented with are Decision tree, Long Short-Term Memory (LSTM), and Conditional Random Field (CRF). In this paper, we present a corpus for NER in Hindi-English Code-Mixed along with extensive experiments on our machine learning models which achieved the best f1-score of 0.95 with both CRF and LSTM.
Transliteration better than translation? answering code-mixed questions over a knowledge base
VISHAL GUPTA,Manoj Chinnakotla,Manish Srivastava
Computational Approaches to Linguistic Code-Switching, CALCS, 2018
@inproceedings{bib_Tran_2018, AUTHOR = {VISHAL GUPTA, Manoj Chinnakotla, Manish Srivastava}, TITLE = {Transliteration better than translation? answering code-mixed questions over a knowledge base}, BOOKTITLE = {Computational Approaches to Linguistic Code-Switching}. YEAR = {2018}}
Humans can learn multiple languages. If they know a fact in one language, they can answer a question in another language they understand. They can also answer Code-mix (CM) questions: questions which contain both languages. This behavior is attributed to the unique learning ability of humans. Our task aims to study if machines can achieve this. We demonstrate how effectively a machine can answer CM questions. In this work, we adopt a two phase approach: candidate generation and candidate re-ranking to answer questions. We propose a Triplet-Siamese-Hybrid CNN (TSHCNN) to re-rank candidate answers. We show experiments on the SimpleQuestions dataset. Our network is trained only on English questions provided in this dataset and noisy Hindi translations of these questions and can answer English-Hindi CM questions effectively without the need of translation into English. Back-transliterated CM questions outperform their lexical and sentence level translated counterparts by 5% & 35% in accuracy respectively, highlighting the efficacy of our approach in a resource constrained setting.
Transzaar: Empowers Human Translators
RASHID AHMAD,PRIYANK GUPTA,Nagaraju Vuppala,Sanket Kumar Pathak, Ashutosh Kumar,Gagan Soni,Sravan Kumar,Manish Srivastava,Avinash K Singh
International Conference on Computational Science and Applications, ICCSA, 2018
@inproceedings{bib_Tran_2018, AUTHOR = {RASHID AHMAD, PRIYANK GUPTA, Nagaraju Vuppala, Sanket Kumar Pathak, Ashutosh Kumar, Gagan Soni, Sravan Kumar, Manish Srivastava, Avinash K Singh}, TITLE = {Transzaar: Empowers Human Translators}, BOOKTITLE = {International Conference on Computational Science and Applications}. YEAR = {2018}}
In this paper, we describe Transzaar -an AI powered tool that offers computer aided translation (CAT) functionality: pre-translation analysis,post-editing machine translated content, translation prediction, text aligning, extensive logging and integration with several machine translation (MT) systems. Transzaar aids a human translator to perform various language processing tasks, viz., Translation, Transliteration, Localization, and other kinds of Text Analysis tasks. Using Transzaar, human translators can post-edit the machine translated content, to improve fluency and accuracy of the translated content to match the naturalness of human translation while delivering with better turn-around time.Transzaar aids the process of post-editing, thereby increasing the productivity of human translators by 2-3 folds within couple of months of usage for certain language pairs. It collects feedback continuously, which helps the MT system to further learn and improve periodically, with the additional new generated data-set
Enabling code-mixed translation: Parallel corpus creation and MT augmentation approach
MRINAL DHAR,VAIBHAV KUMAR,Manish Srivastava
Workshop on Linguistic Resources for Natural Language Processing, LR4NLP-W, 2018
@inproceedings{bib_Enab_2018, AUTHOR = {MRINAL DHAR, VAIBHAV KUMAR, Manish Srivastava}, TITLE = {Enabling code-mixed translation: Parallel corpus creation and MT augmentation approach}, BOOKTITLE = {Workshop on Linguistic Resources for Natural Language Processing}. YEAR = {2018}}
Code-mixing, use of two or more languages in a single sentence, is ubiquitous; generated by multi-lingual speakers across the world. The phenomenon presents itself prominently in social media discourse. Consequently, there is a growing need for translating code-mixed hybrid language into standard languages. However, due to the lack of gold parallel data, existing machine translation systems fail to properly translate code-mixed text. In an effort to initiate the task of machine translation of code-mixed content, we present a newly created parallel corpus of code-mixed English-Hindi and English. We selected previously available English-Hindi code-mixed data as a starting point for the creation of our parallel corpus. We then chose 4 human translators, fluent in both English and Hindi, for translating the 6088 code-mixed English-Hindi sentences to English. With the help of the created parallel corpus, we analyzed the structure of English-Hindi code-mixed data and present a technique to augment run-of-the-mill machine translation (MT) approaches that can help achieve superior translations without the need for specially designed translation systems. We present an augmentation pipeline for existing MT approaches, like Phrase Based MT (Moses) and Neural MT, to improve the translation of code-mixed text. The augmentation pipeline is presented as a pre-processing step and can be plugged with any existing MT system, which we demonstrate by improving translations done by systems like Moses, Google Neural Machine Translation System (NMTS) and Bing Translator for English-Hindi code-mixed content.
Gold Corpus for Telegraphic Summarization
MALIREDDY CHANAKYA,SRIVENKATA N MOUNIKA SOMISETTY,Manish Srivastava
Workshop on Linguistic Resources for Natural Language Processing, LR4NLP-W, 2018
@inproceedings{bib_Gold_2018, AUTHOR = {MALIREDDY CHANAKYA, SRIVENKATA N MOUNIKA SOMISETTY, Manish Srivastava}, TITLE = {Gold Corpus for Telegraphic Summarization}, BOOKTITLE = {Workshop on Linguistic Resources for Natural Language Processing}. YEAR = {2018}}
Most extractive summarization techniques operate by ranking all the source sentences and then select the top-ranked sentences as the summary. Such methods are known to produce good sum-maries, especially when applied to news articles and scientific texts. However, they do not fareso well when applied to texts such as fictional narratives, which do not have a single central or recurrent theme. This is because usually the information or plot of the story is spread across sev-eral sentences. In this paper, we discuss a different summarization technique called Telegraphic Summarization. Here, we do not select whole sentences, rather pick short segments of text spread across sentences, as the summary. We have tailored a set of guidelines to create such summaries and, using the same, annotate a gold corpus of 200 English short stories.
Twitter corpus of resource-scarce languages for sentiment analysis and multilingual emoji prediction
NURENDRA CHOUDHARY,RAJAT SINGH,V ANVESH RAO,Manish Srivastava
International Conference on Computational Linguistics, COLING, 2018
@inproceedings{bib_Twit_2018, AUTHOR = {NURENDRA CHOUDHARY, RAJAT SINGH, V ANVESH RAO, Manish Srivastava}, TITLE = {Twitter corpus of resource-scarce languages for sentiment analysis and multilingual emoji prediction}, BOOKTITLE = {International Conference on Computational Linguistics}. YEAR = {2018}}
In this paper, we leverage social media platforms such as twitter for developing corpus across multiple languages. The corpus creation methodology is applicable for resource-scarce languages provided the speakers of that particular language are active users onsocial media platforms. We present an approach to extract social media microblogs suchas tweets (Twitter). In this paper, we create corpus for multilingual sentiment analysis and emoji prediction in Hindi, Bengali and Telugu. Further, we perform and analyze multiple NLP tasks utilizing the corpus to get interesting observations
Towards word embeddings for improved duplicate bug report retrieval in software repositories
AMAR BUDHIRAJA,KARTIK DUTTA,Manish Srivastava,Raghu Babu Reddy Y
International Conference on Theory of Information Retrieval, ICTIR, 2018
@inproceedings{bib_Towa_2018, AUTHOR = {AMAR BUDHIRAJA, KARTIK DUTTA, Manish Srivastava, Raghu Babu Reddy Y}, TITLE = {Towards word embeddings for improved duplicate bug report retrieval in software repositories}, BOOKTITLE = {International Conference on Theory of Information Retrieval}. YEAR = {2018}}
A key part of software maintenance is bug reporting and rectifica-tion. Bug reporting is a major issue and due to its asynchronous nature, duplicate bug reporting is common. Detecting duplicate bug reports is an important task in software maintenance in order to avoid the assignment of the same bug to different developers.In this paper, we explore the notion of using word embeddings for retrieving duplicate bug report in large software repositories. We discuss an approach to model each bug report as a dense vector and retrieve its top-k most similar reports for duplicate bug report de-tection. Through experiments on two real world datasets, we show that word embeddings perform better than baselines and related approaches and have the potential to improve duplicate bug report retrieval.
NUTS: Network for Unsupervised Telegraphic Summarization
Chanakya Malireddy,Tirth Maniar,Sajal Maheshwari,Manish Srivastava
International Conference on Learning Representations, ICLR, 2018
@inproceedings{bib_NUTS_2018, AUTHOR = {Chanakya Malireddy, Tirth Maniar, Sajal Maheshwari, Manish Srivastava}, TITLE = {NUTS: Network for Unsupervised Telegraphic Summarization}, BOOKTITLE = {International Conference on Learning Representations}. YEAR = {2018}}
Extractive summarization methods operate by ranking and selecting the sentences which best encapsulate the theme of a given document. They do not fare wellin domains like fictional narratives where there is no central theme and core in-formation is not encapsulated by a small set of sentences. For the purpose ofreducing the size of the document while conveying the idea expressed by each sentence, we need more sentence specific methods. Telegraphic summarization,which selects short segments across several sentences, is better suited for such domains. Telegraphic summarization captures the plot better by retaining shorter versions of each sentence while not really concerning itself with grammatically linking these segments. In this paper, we propose an unsupervised deep learning network (NUTS) to generate telegraphic summaries. We use multiple encoder-decoder networks and learn to drop portions of the text that are inferable from thechosen segments. The model is agnostic to both sentence length and style. We demonstrate that the summaries produced by our model show significant quanti-tative and qualitative improvement over those produced by existing methods and baselines.
Aggression detection on social media text using deep neural networks
VINAY KUMAR SINGH,AMAN VARSHNEY,SYED SARFARAZ AKHTAR,DEEPANSHU VIJAY,Manish Srivastava
workshop on abusive language online, ALW, 2018
@inproceedings{bib_Aggr_2018, AUTHOR = {VINAY KUMAR SINGH, AMAN VARSHNEY, SYED SARFARAZ AKHTAR, DEEPANSHU VIJAY, Manish Srivastava}, TITLE = {Aggression detection on social media text using deep neural networks}, BOOKTITLE = {workshop on abusive language online}. YEAR = {2018}}
n the past few years, bully and aggressive posts on social media have grown signifi-cantly, causing serious consequences for vic-tims/users of all demographics. Majority of the work in this field has been done for En-glish only. In this paper, we introduce a deep learning based classification system for Face-book posts and comments of Hindi-English Code-Mixed text to detect the aggressive be-haviour of/towards users. Our work focuses ontext from users majorly in the Indian Subcon-tinent. The dataset that we used for our mod-els is provided byTRAC-11in their shared task. Our classification model assigns each Facebook post/comment to one of the threepredefined categories: “Overtly Aggressive”,“Covertly Aggressive” and “Non-Aggressive”.We experimented with 6 classification mod-els and our CNN model on a 10 K-fold cross-validation gave the best result with the predic-tion accuracy of 73.2%
Siamese lstm with convolutional similarity for similar question retrieval
AVINASH KAMINENI,HARISH YENALA,Manish Srivastava,Manoj Chinnakotla
International Joint Symposium on Artificial Intelligence and Natural Language Processing, iSAI-NLP, 2018
@inproceedings{bib_Siam_2018, AUTHOR = {AVINASH KAMINENI, HARISH YENALA, Manish Srivastava, Manoj Chinnakotla}, TITLE = {Siamese lstm with convolutional similarity for similar question retrieval}, BOOKTITLE = {International Joint Symposium on Artificial Intelligence and Natural Language Processing}. YEAR = {2018}}
In this paper, we model the similar question retrieval task as a binary classification problem. We propose a novel approach of “1D-Siamese LSTM for cQA (1D-SLcQA)” to find the semantic similarity between a new question and existing question(s). In 1D-SLcQA, we use a combination of twin LSTM networks and a contrastive loss function to effectively memorize the long term dependencies i.e., capture semantic similarity even when the length of the answers/questions is very large (200 words). The similarity of the questions is modeled using a single network with (1D) (feature) convolution between feature vectors learned from twin LSTM layers. Experiments on large scale real world Yahoo Answers dataset show that 1D-SLcQA outperform the state of the art approach of Siamese cQA approach (SCQA).
BoWLer: A neural approach to extractive text summarization.
Pranav Dharkas,Manish Srivastava
Pacific Asia Conference on Language, Information and Computation, PACLIC, 2018
@inproceedings{bib_BoWL_2018, AUTHOR = {Pranav Dharkas, Manish Srivastava}, TITLE = {BoWLer: A neural approach to extractive text summarization.}, BOOKTITLE = {Pacific Asia Conference on Language, Information and Computation}. YEAR = {2018}}
While extractive summarization is a well studied problem, it is far from solved. In recent years a large number of interesting and complex models have been used to achieve significant improvements in performance. This can easily be attributed to Deep Learning models and dense vector representations but the performance gain comes with the cost of computational and representational complexity. In this work, we present a simple, yet effective approach for extractive summarization of news articles. In line with many recent works in this area we propose an encoder-decoder architecture with a simple bag of word encoder for sentences followed by an attention based decoder for relevant sentence selection. Our model is trained end-to-end and its performance is comparable to the state-of-the-art models while being simpler both in terms of the number of parameters (significantly lesser) as well as the representational complexity.
Too Many Questions? What Can We Do?: Multiple Question Span Detection.
DANDA PRATHYUSHA,BRIJ MOHAN LAL SRIVASTAVA,Manish Srivastava
Pacific Asia Conference on Language, Information and Computation, PACLIC, 2018
@inproceedings{bib_Too__2018, AUTHOR = {DANDA PRATHYUSHA, BRIJ MOHAN LAL SRIVASTAVA, Manish Srivastava}, TITLE = {Too Many Questions? What Can We Do?: Multiple Question Span Detection.}, BOOKTITLE = {Pacific Asia Conference on Language, Information and Computation}. YEAR = {2018}}
When a human interacts with an information retrieval chat bot, he/she can ask multiple questions at the same time. Current question answering systems can’t handle this scenario effectively. In this paper we propose an approach to identify question spans in a given utterance, by posing this as a sequence labeling problem. The model is trained and evaluated over 4 different freely available datasets. To get a comprehensive coverage of the compound question scenarios, we also synthesize a dataset based on the natural question combination patterns. We exhibit improvement in the performance of the DrQA system when it encounters compound questions which suggests that this approach is vital for real-time human-chatbot interaction.
“Is This A Joke?”: A Large Humor Classification Dataset
Faraz Faruqi,Manish Srivastava
International Conference on Natural Language Processing., ICON, 2018
@inproceedings{bib_“I_2018, AUTHOR = {Faraz Faruqi, Manish Srivastava}, TITLE = {“Is This A Joke?”: A Large Humor Classification Dataset}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2018}}
Humor is an essential characteristic of language. It has been a topic of research in linguistics and philosophy from historical times. In computer science, computational humor, as a part of Natural Language Processing, is a growing area of research. Social Media is rapidly growing as a platform for communication but processing of social media, owing to its semantic perplexity, is still a challenge. These two facts lead us to present a novel dataset for humor classification which captures diversity in humor on web resources. The large size of this dataset is to meet the data requirements for modern machine learning algorithms. This paper also deals with creating a model for detecting and analyzing humor in social media text extracted from eclectic sources on the Internet.
Exploring Chunk Based Templates for Generating a subset of English Text
NIKHILESH BHATNAGAR,Manish Srivastava,Radhika Mamidi
Student Research Workshop, SRW, 2018
@inproceedings{bib_Expl_2018, AUTHOR = {NIKHILESH BHATNAGAR, Manish Srivastava, Radhika Mamidi}, TITLE = {Exploring Chunk Based Templates for Generating a subset of English Text}, BOOKTITLE = {Student Research Workshop}. YEAR = {2018}}
Natural Language Generation (NLG) is a research task which addresses the automatic generation of natural language text representative of an input non-linguistic collection of knowledge. In this paper, we address the task of the generation of grammatical sentences in an isolated context given a partial bag-of-words which the generated sentence must contain. We view the task as a search problem (a problem of choice) involving combinations of smaller chunk based templates extracted from a training corpus to construct a complete sentence. To achieve that, we propose a fitness function which we use in conjunction with an evolutionary algorithm as the search procedure to arrive at a potentially grammatical sentence (modeled by the fitness score) which satisfies the input constraints.
SWDE : A Sub-Word And Document Embedding Based Engine for Clickbait Detection
VAIBHAV KUMAR,DHRUV KHATTAR,MRINAL DHAR,Yash Kumar Lal,Abhimanshu Mishra,Vasudeva Varma Kalidindi,Manish Srivastava
International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, 2018
@inproceedings{bib_SWDE_2018, AUTHOR = {VAIBHAV KUMAR, DHRUV KHATTAR, MRINAL DHAR, Yash Kumar Lal, Abhimanshu Mishra, Vasudeva Varma Kalidindi, Manish Srivastava}, TITLE = {SWDE : A Sub-Word And Document Embedding Based Engine for Clickbait Detection}, BOOKTITLE = {International ACM SIGIR Conference on Research and Development in Information Retrieval}. YEAR = {2018}}
In order to expand their reach and increase website ad revenue, media outlets have started using clickbait techniques to lure readers to click on articles on their digital platform. Having successfully enticed the user to open the article, the article fails to satiate his curiosity serving only to boost click-through rates. Initial methods for this task were dependent on feature engineering, which varies with each dataset. Industry systems have relied on an exhaustive set of rules to get the job done. Neural networks have barely been explored to perform this task. We propose a novel approach considering different textual embeddings of a news headline and the related article. We generate sub-word level embeddings of the title using Convolutional Neural Networks and use them to train a bidirectional LSTM architecture. An attention layer allows for calculation of significance of each term towards the nature of the post. We also generate Doc2Vec embeddings of the title and article text and model how they interact, following which it is concatenated with the output of the previous component. Finally, this representation is passed through a neural network to obtain a score for the headline. We test our model over 2538 posts (having trained it on 17000 records) and achieve an accuracy of 83.49% outscoring previous state-of-the-art approaches.
Classifying Tweets using Character and Word Level Features
ANKUSH KHANDELWAL,SAHIL SWAMI,SYED SARFARAZ AKHTAR,Manish Shrivastava
Evaluation of Human Language Technologies for Iberian Languages Workshop, IberEval, 2017
@inproceedings{bib_Clas_2017, AUTHOR = {ANKUSH KHANDELWAL, SAHIL SWAMI, SYED SARFARAZ AKHTAR, Manish Shrivastava}, TITLE = {Classifying Tweets using Character and Word Level Features}, BOOKTITLE = {Evaluation of Human Language Technologies for Iberian Languages Workshop}. YEAR = {2017}}
This paper describes the International Institute of Information Technology of Hyderabad’s submission to the task Classification Of Spanish Election Tweets (COSET) as a part of IBEREVAL-2017[1]. The task is to classify Spanish election tweets into political, policy, personal, campaign and other issues. Our system uses Support Vector Machines with radial basis function kernel to classify tweets. We dwell upon the character and word level features along with the word embeddings and train the classification model with them and present the results. Our best run achieves a F1-macro score of 0.6054 on the test corpus for first phase and 0.8509 for the second phase.
Improve performance of machine translation service using memcached
PRIYANK GUPTA,RASHID AHMAD,Manish Srivastava,Pawan Kumar,Mukul K Sinha
International Conference on Computational Science and Applications, ICCSA, 2017
@inproceedings{bib_Impr_2017, AUTHOR = {PRIYANK GUPTA, RASHID AHMAD, Manish Srivastava, Pawan Kumar, Mukul K Sinha}, TITLE = {Improve performance of machine translation service using memcached}, BOOKTITLE = {International Conference on Computational Science and Applications}. YEAR = {2017}}
Sampark is machine translation system providing translations among nine pairs of Indian languages. Machine translation system is a class of natural language processing applications that is far more complex and highly compute intensive in nature. As the load on the deployed system increases, optimization becomes a challenge. Caching is one of the available options to improve the performance of a software system with increasing load which exhibit the characteristics of locality of reference. Sampark MT system being a natural language processing application exhibits this characteristic.Memcachedis a well-known, simple, in-memory caching solution that has been applied to improve the performance of several distributed web applications in the past. This paper describes how memcached has been applied to improve the performance of Sampark machine translation service which is deployed on a large cluster of machines. By applying distributed caching to MT system the performance of the system has improved upto 40%
LTRC IIITH at IBEREVAL 2017: Stance and Gender Detection in Tweets on Catalan Independence
SAHIL SWAMI,ANKUSH KHANDELWAL,Manish Srivastava,SYED SARFARAZ AKHTAR
Evaluation of Human Language Technologies for Iberian Languages Workshop, IberEval, 2017
@inproceedings{bib_LTRC_2017, AUTHOR = {SAHIL SWAMI, ANKUSH KHANDELWAL, Manish Srivastava, SYED SARFARAZ AKHTAR}, TITLE = {LTRC IIITH at IBEREVAL 2017: Stance and Gender Detection in Tweets on Catalan Independence}, BOOKTITLE = {Evaluation of Human Language Technologies for Iberian Languages Workshop}. YEAR = {2017}}
We describe the system submitted to IBEREVAL-2017 for stance and gender detection in tweets on Catalan Independence [1]. We developed a supervised system using Support Vector Machines with radial basis function kernel to identify the stance and gender of the tweeter using various character level and word level features. Our system achieves a macro-average of F-score(FAVOR) and F-score(AGAINST) of 0.46 for stance detection in both Spanish and Catalan and an accuracy of 64.85% and 44.59% for Gender detection in Spanish and Catalan respectively.
Relevance Scoring of Triples Using Ordinal Logistic Classification
NAUSHEEN FATMA,Manoj K. Chinnakotla,Manish Srivastava
International conference on Web search and Data Mining, WSDM, 2017
@inproceedings{bib_Rele_2017, AUTHOR = {NAUSHEEN FATMA, Manoj K. Chinnakotla, Manish Srivastava}, TITLE = {Relevance Scoring of Triples Using Ordinal Logistic Classification}, BOOKTITLE = {International conference on Web search and Data Mining}. YEAR = {2017}}
In this paper, we report our participation in the Task 2: Triple Scoring of WSDM Cup challenge 2017. In this task, we were provided with triples of "type-like" relations which were given human-annotated relevance scores ranging from 0 to 7, with 7 being the "most relevant" and 0 being the "least relevant". The task focuses on two such relations: profession and nationality. We built a system which could automatically predict the relevance scores for unseen triples. Our model is primarily a supervised machine learning based one in which we use well-designed features which are used to a make a Logistic Ordinal Regression based classification model. The proposed system achieves an overall accuracy score of 0.73 and Kendall’s tau score of 0.36.
The Unusual Suspects: Deep Learning Based Mining of Interesting Entity Trivia from Knowledge Graphs
NAUSHEEN FATMA,Manoj K. Chinnakotla,Manish Srivastava
AAAI Conference on Artificial Intelligence, AAAI, 2017
@inproceedings{bib_The__2017, AUTHOR = {NAUSHEEN FATMA, Manoj K. Chinnakotla, Manish Srivastava}, TITLE = {The Unusual Suspects: Deep Learning Based Mining of Interesting Entity Trivia from Knowledge Graphs}, BOOKTITLE = {AAAI Conference on Artificial Intelligence}. YEAR = {2017}}
Trivia is any fact about an entity which is interesting due to its unusualness, uniqueness or unexpectedness. Trivia could be successfully employed to promote user engagement in various product experiences featuring the given entity. A Knowledge Graph (KG) is a semantic network which encodes various facts about entities and their relationships. In this paper, we propose a novel approach called DBpedia Trivia Miner (DTM) to automatically mine trivia for entities of a given domain in KGs. The essence of DTM lies in learning an Interestingness Model (IM), for a given domain, from human annotated training data provided in the form of interesting facts from the KG. The IM thus learnt is applied to extract trivia for other entities of the same domain in the KG. We propose two different approaches for learning the IM - a) A Convolutional Neural Network (CNN) based approach and b) Fusion Based CNN (F-CNN) approach which combines both hand-crafted and CNN features. Experiments across two different domains - Bollywood Actors and Music Artists reveal that CNN automatically learns features which are relevant to the task and shows competitive performance relative to hand-crafted feature based baselines whereas FCNN significantly improves the performance over the baseline approaches which use hand-crafted features alone. Overall, DTM achieves an F1 score of 0.81 and 0.65 in Bollywood Actors and Music Artists domains respectively.
Word Similarity Datasets for Indian Languages: Annotation and Baseline Systems
SYED SARFARAZ AKHTAR,ARIHANT GUPTA,AVIJIT VAJPAYEE,ARJIT SRIVASTAVA,Manish Srivastava
Linguistic Annotation Workshop, LAW, 2017
@inproceedings{bib_Word_2017, AUTHOR = {SYED SARFARAZ AKHTAR, ARIHANT GUPTA, AVIJIT VAJPAYEE, ARJIT SRIVASTAVA, Manish Srivastava}, TITLE = {Word Similarity Datasets for Indian Languages: Annotation and Baseline Systems}, BOOKTITLE = {Linguistic Annotation Workshop}. YEAR = {2017}}
With the advent of word representations, word similarity tasks are becoming increasing popular as an evaluation metric for the quality of the representations. In this paper, we present manually annotated monolingual word similarity datasets of six Indian languages – Urdu, Telugu, Marathi, Punjabi, Tamil and Gujarati. These languages are most spoken Indian languages worldwide after Hindi and Bengali. For the construction of these datasets, our approach relies on translation and re-annotation of word similarity datasets of English. We also present baseline scores for word representation models using state-of-the-art techniques for Urdu, Telugu and Marathi by evaluating them on newly created word similarity datasets.
Joining Hands: Exploiting Monolingual Treebanks for Parsing of Code-mixing Data
IRSHAD AHMAD BHAT,RIYAZ AHMAD BHAT,Manish Srivastava,Dipti Mishra Sharma
Conference of the European Chapter of the Association for Computational Linguistics (EACL), EACL, 2017
@inproceedings{bib_Join_2017, AUTHOR = {IRSHAD AHMAD BHAT, RIYAZ AHMAD BHAT, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {Joining Hands: Exploiting Monolingual Treebanks for Parsing of Code-mixing Data}, BOOKTITLE = {Conference of the European Chapter of the Association for Computational Linguistics (EACL)}. YEAR = {2017}}
In this paper, we propose efficient and less resource-intensive strategies for parsing of code-mixed data. These strategies are not constrained by in-domain annotations, rather they leverage pre-existing monolingual annotated resources for training. We show that these methods can produce significantly better results as compared to an informed baseline. Due to lack of an evaluation set for code-mixed structures, we also present a data set of 450 Hindi and English code-mixed tweets of Hindi multilingual speakers for evaluation.
Significance of neural phonotactic models for large-scale spoken language identification
BRIJ MOHAN LAL SRIVASTAVA,VYDANA HARI KRISHNA,Anil Kumar Vuppala,Manish Srivastava
International Joint Conference on Neural Networks, IJCNN, 2017
@inproceedings{bib_Sign_2017, AUTHOR = {BRIJ MOHAN LAL SRIVASTAVA, VYDANA HARI KRISHNA, Anil Kumar Vuppala, Manish Srivastava}, TITLE = {Significance of neural phonotactic models for large-scale spoken language identification}, BOOKTITLE = {International Joint Conference on Neural Networks}. YEAR = {2017}}
Language identification (LID) is vital frontend for spoken dialogue systems operating in diverse linguistic settings to reduce recognition and understanding errors. Existing LID systems which use low-level signal information for classification do not scale well due to exponential growth of parameters as the classes increase. They also suffer performance degradation due to the inherent variabilities of speech signal. In the proposed approach, we model the language-specific phonotactic information in speech using recurrent neural network for developing an LID system. The input speech signal is tokenized to phone sequences by using a common language-independent phone recognizer with varying phonetic coverage. We establish a causal relationship between phonetic coverage and LID performance. The phonotactics in the observed phone sequences are modeled using statistical and recurrent neural network language models to predict language-specific symbol from a universal phonetic inventory. Proposed approach is robust, computationally light weight and highly scalable. Experiments show that the convex combination of statistical and recurrent neural network language model (RNNLM) based phonotactic models significantly outperform a strong baseline system of Deep Neural Network (DNN) which is shown to surpass the performance of i-vector based approach for LID. The proposed approach outperforms the baseline models in terms of mean F1 score over 176 languages. Further we provide significant information-theoretic evidence to analyze the mechanism of the proposed approach.
Sentiment analysis using relative prosody features
HARIKA ABBURI,KNRK RAJU ALLURI,Anil Kumar Vuppala,Manish Srivastava,Suryakanth Gangashetty
International Conference on Contemporary Computing, IC3, 2017
@inproceedings{bib_Sent_2017, AUTHOR = {HARIKA ABBURI, KNRK RAJU ALLURI, Anil Kumar Vuppala, Manish Srivastava, Suryakanth Gangashetty}, TITLE = {Sentiment analysis using relative prosody features}, BOOKTITLE = {International Conference on Contemporary Computing}. YEAR = {2017}}
Recent improvement in usage of digital media has led people to share their opinions about specific entity through audio. In this paper, an approach to detect the sentiment of an online spoken reviews based on relative prosody features is presented. Most of the existing systems for audio based sentiment analysis use conventional audio features, but they are not problem specific features to extract the sentiment. In this work, relative prosody features are extracted from normal and stressed regions of audio signal to detect the sentiment. Stressed regions are identified using the strength of excitation. Support Vector Machine (SVM) and Gaussian Mixture Model (GMM) classifiers are used to build the sentiment models. MOUD database is used for the proposed study. Experimental results show that, the rate of detecting the sentiment is improved with relative prosody features compared with the prosody and Mel Frequency Cepstral Coefficients (MFCC) because the relative prosody features has more sentiment specific discrimination compared to prosody features.
Exploiting Morphological Regularities in Distributional Word Representations
AVIJIT VAJPAYEE,ARIHANT GUPTA,SYED SARFARAZ AKHTAR,ARJIT SRIVASTAVA,MADAN GOPAL JHANWAR,Manish Srivastava
Conference on Empirical Methods in Natural Language Processing, EMNLP, 2017
@inproceedings{bib_Expl_2017, AUTHOR = {AVIJIT VAJPAYEE, ARIHANT GUPTA, SYED SARFARAZ AKHTAR, ARJIT SRIVASTAVA, MADAN GOPAL JHANWAR, Manish Srivastava}, TITLE = {Exploiting Morphological Regularities in Distributional Word Representations}, BOOKTITLE = {Conference on Empirical Methods in Natural Language Processing}. YEAR = {2017}}
We present a simple, fast and unsupervised approach for exploiting morphological regularities present in high dimensional vector spaces. We propose a novel method for generating embeddings of words from their morphological variants using morphological transformation operators. We evaluate this approach on MSR word analogy test set (Mikolov et al.,2013d) with an accuracy of 85% which is 12% higher than the previous best known system.
WebShodh: A Code Mixed Factoid Question Answering System for Web
Khyathi Raghavi Chandu,Manoj Chinnakotla,Alan W Black,Manish Srivastava
International Conference of the CLEF Association, CLEFS, 2017
@inproceedings{bib_WebS_2017, AUTHOR = {Khyathi Raghavi Chandu, Manoj Chinnakotla, Alan W Black, Manish Srivastava}, TITLE = {WebShodh: A Code Mixed Factoid Question Answering System for Web}, BOOKTITLE = {International Conference of the CLEF Association}. YEAR = {2017}}
Code-Mixing (CM) is a natural phenomenon observed in many multilingual societies and is becoming the preferred medium of expression and communication in online and social media fora. In spite of this, current Question Answering (QA) systems do not support CM and are only designed to work with a single interaction language. This assumption makes it inconvenient for multi-lingual users to interact naturally with the QA system especially in scenarios where they do not know the right word in the target language. In this paper, we present WebShodh - an end-end web-based Factoid QA system for CM languages. We demonstrate our system with two CM language pairs: Hinglish (Matrix language: Hindi, Embedded language: English) and Tenglish (Matrix language: Telugu, Embedded language: English). Lack of language resources such as annotated corpora, POS taggers or parsers for CM languages poses a huge challenge for automated processing and analysis. In view of this resource scarcity, we only assume the existence of bi-lingual dictionaries from the matrix languages to English and use it for lexically translating the question into English. Later, we use this loosely translated question for our downstream analysis such as Answer Type(AType) prediction, answer retrieval and ranking. Evaluation of our system reveals that we achieve an MRR of 0.37 and 0.32 for Hinglish and Tenglish respectively. We hosted this system online and plan to leverage it for collecting more CM questions and answers data for further improvement.
Injecting Word Embeddings with Another Language’s Resource : An Application of Bilingual Embeddings
PRAKHAR PANDEY,Vikram Pudi,Manish Srivastava
International Joint Conference on Natural Language Processing, IJCNLP, 2017
@inproceedings{bib_Inje_2017, AUTHOR = {PRAKHAR PANDEY, Vikram Pudi, Manish Srivastava}, TITLE = {Injecting Word Embeddings with Another Language’s Resource : An Application of Bilingual Embeddings}, BOOKTITLE = {International Joint Conference on Natural Language Processing}. YEAR = {2017}}
Word embeddings learned from text corpus can be improved by injecting knowledge from external resources, while at the same time also specializing them for similarity or relatedness. These knowledge resources (like WordNet, Paraphrase Database) may not exist for all languages. In this work we introduce a method to inject word embeddings of a language with knowledge resource of another language by leveraging bilingual embeddings. First we improve word embeddings of German, Italian, French and Spanish using resources of English and test them on variety of word similarity tasks. Then we demonstrate the utility of our method by creating improved embeddings for Urdu and Telugu languages using Hindi WordNet, beating the previously established baseline for Urdu.
Deep Neural Network based system for solving Arithmetic Word problems
Purvanshi Mehta,PRUTHWIK MISHRA,ATHAVALE VINAYAK SANJAY,Manish Srivastava,Dipti Mishra Sharma
International Joint Conference on Natural Language Processing, IJCNLP, 2017
@inproceedings{bib_Deep_2017, AUTHOR = {Purvanshi Mehta, PRUTHWIK MISHRA, ATHAVALE VINAYAK SANJAY, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {Deep Neural Network based system for solving Arithmetic Word problems}, BOOKTITLE = {International Joint Conference on Natural Language Processing}. YEAR = {2017}}
This paper presents DILTON, a system which solves simple arithmetic word problems. DILTON first predicts the operation that is to be performed (’-’,’+’,’*’,’/’) through a deep neural network based model and then uses it to generate the answer. DILTON divides the question into two parts - world state and query as shown in Figure 1. The world state and the query are processed separately in two different networks and finally the networks are merged to predict the final operation. DILTON learns to predict operations with 8.81 % in a corpus of primary school questions. With simple similarity between the contexts of quantities appearing in the problem and the question text, we are able to identify 92.25 % of relevant quantities and solve 81% of the questions. Our code and data is publicly available
An Unsupervised Approach for Mapping between Vector Spaces
SYED SARFARAZ AKHTAR,ARIHANT GUPTA,AVIJIT VAJPAYEE,ARJIT SRIVASTAVA,MADAN GOPAL JHANWAR,Manish Srivastava
Technical Report, arXiv, 2017
@inproceedings{bib_An_U_2017, AUTHOR = {SYED SARFARAZ AKHTAR, ARIHANT GUPTA, AVIJIT VAJPAYEE, ARJIT SRIVASTAVA, MADAN GOPAL JHANWAR, Manish Srivastava}, TITLE = {An Unsupervised Approach for Mapping between Vector Spaces}, BOOKTITLE = {Technical Report}. YEAR = {2017}}
We present a language independent, unsupervised approach for transforming word embeddings from source language to target language using a transformation matrix. Our model handles the problem of data scarcity which is faced by many languages in the world and yields improved word embeddings for words in the target language by relying on transformed embeddings of words of the source language. We initially evaluate our approach via word similarity tasks on a similar language pair - Hindi as source and Urdu as the target language, while we also evaluate our method on French and German as target languages and English as source language. Our approach improves the current state of the art results - by 13% for French and 19% for German. For Urdu, we saw an increment of 16% over our initial baseline score. We further explore the prospects of our approach by applying it on multiple models of the same language and transferring words between the two models, thus solving the problem of missing words in a model. We evaluate this on word similarity and word analogy tasks.
Unsupervised Morphological Expansion of Small Datasets for Improving Word Embeddings
SYED SARFARAZ AKHTAR,ARIHANT GUPTA,AVIJIT VAJPAYEE,ARJIT SRIVASTAVA,Manish Srivastava
Technical Report, arXiv, 2017
@inproceedings{bib_Unsu_2017, AUTHOR = {SYED SARFARAZ AKHTAR, ARIHANT GUPTA, AVIJIT VAJPAYEE, ARJIT SRIVASTAVA, Manish Srivastava}, TITLE = {Unsupervised Morphological Expansion of Small Datasets for Improving Word Embeddings}, BOOKTITLE = {Technical Report}. YEAR = {2017}}
We present a language independent, unsupervised method for building word embeddings using morphological expansion of text. Our model handles the problem of data sparsity and yields improved word embeddings by relying on training word embeddings on artificially generated sentences. We evaluate our method using small sized training sets on eleven test sets for the word similarity task across seven languages. Further, for English, we evaluated the impacts of our approach using a large training set on three standard test sets. Our method improved results across all languages.
End to End Dialog System for Telugu
DANDA PRATHYUSHA,PRATHYUSHA JWALAPURAM,Manish Srivastava
International Conference on Natural Language Processing., ICON, 2017
@inproceedings{bib_End__2017, AUTHOR = {DANDA PRATHYUSHA, PRATHYUSHA JWALAPURAM, Manish Srivastava}, TITLE = {End to End Dialog System for Telugu}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2017}}
This paper describes an end to end dialog system created using sequence to sequence learning and memory networks for Telugu, a low-resource language. We automatically generate dialog data for Telugu in the tourist domain, using a knowledge base that provides tourist place, type, tour time, etc. Using this data, we train a sequence to sequence model to learn system responses in the dialog. In order to add the query prediction for information retrieval (through API calls), we train a memory network. We also handle cases requiring updation of API calls and querying for additional information. Using the combination of sequence to sequence learning and memory network, we successfully create an end to end dialog system for Telugu.
Beyond Word2Vec: Embedding Words and Phrases in Same Vector Space
Vijay Prakash Dwivedi,Manish Srivastava
International Conference on Natural Language Processing., ICON, 2017
@inproceedings{bib_Beyo_2017, AUTHOR = {Vijay Prakash Dwivedi, Manish Srivastava}, TITLE = {Beyond Word2Vec: Embedding Words and Phrases in Same Vector Space}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2017}}
Word embeddings are being used for several linguistic problems and NLP tasks. Improvements in solutions to such problems are great because of the recent breakthroughs in vector representation of words and research in vector space models. However, vector embeddings of phrases keeping semantics intact with words has been challenging. We propose a novel methodology using Siamese deep neural networks to embed multi-word units and fine-tune the current state-of-the-art word embeddings keeping both in the same vector space. We show several semantic relations between words and phrases using the embeddings generated by our system and evaluate that the similarity of words and their corresponding paraphrases are maximized using the modified embeddings.
DNN-HMM Acoustic Modeling for Large Vocabulary Telugu Speech Recognition
VISHNU VIDYADHARA RAJU V,GURUGUBELLI KRISHNA,VYDANA HARI KRISHNA,Bhargav Pulugundla,Manish Srivastava,Anil Kumar Vuppala
International Conference on Mining Intelligence and Knowledge Exploration, MIKE, 2017
@inproceedings{bib_DNN-_2017, AUTHOR = {VISHNU VIDYADHARA RAJU V, GURUGUBELLI KRISHNA, VYDANA HARI KRISHNA, Bhargav Pulugundla, Manish Srivastava, Anil Kumar Vuppala}, TITLE = {DNN-HMM Acoustic Modeling for Large Vocabulary Telugu Speech Recognition}, BOOKTITLE = {International Conference on Mining Intelligence and Knowledge Exploration}. YEAR = {2017}}
The main focus of this paper is towards the development of a large vocabulary Telugu speech database. Telugu is a low resource language where there exists no standardized database for building the speech recognition system (ASR). The database consists of neutral speech samples collected from 100 speakers for building the Telugu ASR system and it was named as IIIT-H Telugu speech corpus. The speech and text corpus design and the procedure followed for the collection of the database have been discussed in detail. The preliminary ASR system results for the models built in this database are reported. The architectural choices of deep neural networks (DNNs) play a crucial role in improving the performance of ASR systems. ASR trained with hybrid DNNs (DNNHMM) with more hidden layers have shown better performance over the conventional GMMs (GMM-HMM). Kaldi tool kit is used for building the acoustic models required for the ASR system.
Significance of DNN-AM for Multimodal Sentiment Analysis
HARIKA ABBURI,Ra jendra Prasath,Manish Srivastava,Suryakanth Gangashetty
International Conference on Mining Intelligence and Knowledge Exploration, MIKE, 2017
@inproceedings{bib_Sign_2017, AUTHOR = {HARIKA ABBURI, Ra Jendra Prasath, Manish Srivastava, Suryakanth Gangashetty}, TITLE = {Significance of DNN-AM for Multimodal Sentiment Analysis}, BOOKTITLE = {International Conference on Mining Intelligence and Knowledge Exploration}. YEAR = {2017}}
The furtherance of social media led people to share the reviews in various ways such as video, audio and text. Recently, the performance of sentiment classification is achieved success using neural networks. In this paper, neural network approach is presented to detect the sentiment from audio and text models. For audio, features like Mel Frequency Cepstral Coefficients (MFCC) are used to build Deep Neural Network (DNN) and Deep Neural Network Attention Mechanism (DNNAM) classifiers. From the results, it is noticed that DNNAM gives better results compared to DNN because the DNN is a frame based one where as the DNNAM is an utterance level classification thereby efficiently use the context. Additionally, textual features are extracted from the transcript of the audio input using Word2vec model. Support Vector Machine (SVM) and Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) classifiers are used to develop a sentiment model. From the experiments it is noticed the LSTM-RNN outperforms the SVM as the LSTM-RNN is able to memorize long temporal context. The performance is also significantly improved by combining both the audio and text modalities.
Multimodal sentiment analysis using deep neural networks
Harika Abburi,Rajendra Prasath,Manish Srivastava,Suryakanth Gangashetty
International Conference on Mining Intelligence and Knowledge Exploration, MIKE, 2016
@inproceedings{bib_Mult_2016, AUTHOR = {Harika Abburi, Rajendra Prasath, Manish Srivastava, Suryakanth Gangashetty}, TITLE = {Multimodal sentiment analysis using deep neural networks}, BOOKTITLE = {International Conference on Mining Intelligence and Knowledge Exploration}. YEAR = {2016}}
Due to increase of online product reviews posted daily through various modalities such as video, audio and text, sentimental analysis has gained huge attention. Recent developments in web technologies have also enabled the increase of web content in Hindi. In this paper, an approach to detect the sentiment of an online Hindi product reviews based on its multi-modality natures (audio and text) is presented. For each audio input, Mel Frequency Cepstral Coefficients (MFCC) features are extracted. These features are used to develop a sentiment models using Gaussian Mixture Models (GMM) and Deep Neural Network (DNN) classifiers. From results, it is observed that DNN classifier gives better results compare to GMM. Further textual features are extracted from the transcript of the audio input by using Doc2vec vectors. Support Vector Machine (SVM) classifier is used to develop a sentiment model using these textual features. From experimental results it is observed that combining both the audio and text features results in improvement in the performance for detecting the sentiment of an online product reviews.
Code Mixed Entity Extraction in Indian Languages using Neural Networks
IRSHAD AHMAD BHAT,Manish Srivastava,RIYAZ AHMAD BHAT
Forum for Information Retrieval Evaluation, FIRE, 2016
@inproceedings{bib_Code_2016, AUTHOR = {IRSHAD AHMAD BHAT, Manish Srivastava, RIYAZ AHMAD BHAT}, TITLE = {Code Mixed Entity Extraction in Indian Languages using Neural Networks}, BOOKTITLE = {Forum for Information Retrieval Evaluation}. YEAR = {2016}}
In this paper we present our submission for FIRE 2016 Shared Task on Code Mixed Entity Extraction in Indian Languages. We describe a Neural Network system for Entity Extraction in Hindi-English Code Mixed text. Our method uses distributed word representations as features for the Neural Network and therefore, can easily be replicated across languages. Our system ranked first place forHindi-English with an F1-score of 68.24%
Mirror on the wall: Finding similar questions with deep structured topic modeling
ARPITA DAS,Manish Srivastava,Manoj Chinnakotla
Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD, 2016
@inproceedings{bib_Mirr_2016, AUTHOR = {ARPITA DAS, Manish Srivastava, Manoj Chinnakotla}, TITLE = {Mirror on the wall: Finding similar questions with deep structured topic modeling}, BOOKTITLE = {Pacific-Asia Conference on Knowledge Discovery and Data Mining}. YEAR = {2016}}
Internet users today prefer getting precise answers to their questions rather than sifting through a bunch of relevant documents provided by search engines. This has led to the huge popularity of Community Question Answering (cQA) services like Yahoo! Answers, Baidu Zhidao, Quora, Stack Over flowetc., where forum users respond to questions with precise answers. Over time, such cQA archives become rich repositories of knowledge encoded in the form of questions and user generated answers. In cQA archives, retrieval of similar questions, which have already been answered in some form, is important for improving the effectiveness of such forums. The main challenge while retrieving similar questions is the “lexico-syntactic” gap between the user query and the questions already present in the forum. In this paper, we pro-pose a novel approach called “Deep Structured Topic Model (DSTM)” to bridge the lexico-syntactic gap between the question posed by the user and forum questions. DSTM employs a two-step process consisting of initially retrieving similar questions that lie in the vicinity of the query and latent topic vector space and then re-ranking them using a deep layered semantic model. Experiments on large scale real-life cQA dataset show that our approach outperforms the state-of-the-art translation and topic based baseline approaches
Transition-Based Syntactic Linearization with Lookahead Features
PUDUPPULLY RATISH SURENDRAN,Yue Zhang,Manish Srivastava
Conference of the North American Chapter of the Association for Computational Linguistics, NAACL, 2016
@inproceedings{bib_Tran_2016, AUTHOR = {PUDUPPULLY RATISH SURENDRAN, Yue Zhang, Manish Srivastava}, TITLE = {Transition-Based Syntactic Linearization with Lookahead Features}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics}. YEAR = {2016}}
It has been shown that transition-based methods can be used for syntactic word ordering and tree linearization, achieving significantly faster speed compared with traditional best-first methods. State-of-the-art transition-based models give competitive results on ab-stract word ordering and unlabeled tree lin-earization, but significantly worse results on labeled tree linearization. We demonstrate that the main cause for the performance bottle-neck is the sparsity of SHIFT transition actions rather than heavy pruning. To address this is-sue, we propose a modification to the standard transition-based feature structure, which reduces feature sparsity and allows look ahead features at a small cost to decoding efficiency.Our model gives the best reported accuracies on all benchmarks, yet still being over 30times faster compared with best-first-search.
Kathaa: A visual programming framework for nlp applications
SHARADA PRASANNA MOHANTY,NEHAL JAGDISH WANI,Manish Srivastava,Dipti Mishra Sharma
Conference of the North American Chapter of the Association for Computational Linguistics, NAACL, 2016
@inproceedings{bib_Kath_2016, AUTHOR = {SHARADA PRASANNA MOHANTY, NEHAL JAGDISH WANI, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {Kathaa: A visual programming framework for nlp applications}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics}. YEAR = {2016}}
In this paper, we present Kathaa1, anopen source web based Visual Programming Framework for NLP applications. It supports design, execution and analysis of complex NLP systems by choosing and visually connecting NLP modules from an already avail-able and easily extensible Module library. It models NLP systems as a Directed Acyclic Graph of optionally parallalized information flow, and lets the user choose and use avail-able modules in their NLP applications irrespective of their technical proficiency. Kathaa exposes a precise Module definition API to al-low easy integration of external NLP components (along with their associated services as docker containers), it allows everyone to publish their services in a standardized format for everyone else to use it out of the box.
Deep feature fusion network for answer quality prediction in community question answering
SUGGU SAI PRANEETH,TATAKUNTLA KUSHWANTH NAGA GOUTHAM,Manoj K. Chinnakotla,Manish Srivastava
Workshop on Neural Information Retrieval,, NIR-W, 2016
@inproceedings{bib_Deep_2016, AUTHOR = {SUGGU SAI PRANEETH, TATAKUNTLA KUSHWANTH NAGA GOUTHAM, Manoj K. Chinnakotla, Manish Srivastava}, TITLE = {Deep feature fusion network for answer quality prediction in community question answering}, BOOKTITLE = {Workshop on Neural Information Retrieval,}. YEAR = {2016}}
Community Question Answering (cQA) forums have become a popular medium for soliciting direct answers to specific questions of users from experts or other experienced users on a given topic. However, for a given question, users some-times have to sift through a large number of low-quality or irrelevant answers to find out the answer which satisfies their information need. To alleviate this, the problem of Answer Quality Prediction (AQP) aims to predict the quality of an answer posted in response to a forum question.Current AQP systems either learn models using - a) various hand-crafted features (HCF) or b) use deep learning (DL)techniques which automatically learn the required feature representations.In this paper, we propose a novel approach for AQP known as -“Deep Feature Fusion Network (DFFN)”which leverages the advantages of both hand-crafted features and deep learning based systems. Given a question-answer pair along with its metadata, DFFN independently - a) learns deep features using a Convolutional Neural Network (CNN) and b) computes hand-crafted features using various external resourcesand then combines them using a deep neural network trained to predict the final answer quality. DFFN achieves state-of-the-art performance on the standard SemEval-2015 andSemEval-2016 benchmark datasets and outperforms base-line approaches which individually employ either HCF or DL based techniques alone.
Together we stand: Siamese networks for similar question retrieval
ARPITA DAS,HARISH YENALA,Manoj Chinnakotla,Manish Srivastava
Conference of the Association of Computational Linguistics, ACL, 2016
@inproceedings{bib_Toge_2016, AUTHOR = {ARPITA DAS, HARISH YENALA, Manoj Chinnakotla, Manish Srivastava}, TITLE = {Together we stand: Siamese networks for similar question retrieval}, BOOKTITLE = {Conference of the Association of Computational Linguistics}. YEAR = {2016}}
Community Question Answering (cQA)services like Yahoo!Answers1, BaiduZhidao2, Quora3, StackOverflow4etc.provide a platform for interaction with experts and help users to obtain precise and accurate answers to their questions.The time lag between the user posting a question and receiving its answer could be reduced by retrieving similar historic questions from the cQA archives.The main challenge in this task is the “lexico-syntactic” gap between the current and the previous questions. In this paper, we pro-pose a novel approach called“Siamese Convolutional Neural Network for cQA(SCQA)”to find the semantic similarity between the current and the archived ques-tions.SCQA consist of twin convolutional neural networks with shared parameters and a contrastive loss function joining them. SCQA learns the similarity metric for question-question pairs by leveraging the question-answer pairs available in cQA forum archives. The model projects semantically similar question pairs nearer to each other and dissimilar question pairs farther away from each other in the semantic space. Experiments on large scale real-life “Yahoo! Answers” dataset reveals that SCQA outperforms current state-of-the-art approaches based on translation models, topic models and deep neural network based models which use non-shared parameters.
Articulatory Gesture Rich Representation Learning of Phonological Units in Low Resource Settings
BRIJ MOHAN LAL SRIVASTAVA,Manish Srivastava
International Conference on Statistical Language and Speech Processing, SLSP, 2016
@inproceedings{bib_Arti_2016, AUTHOR = {BRIJ MOHAN LAL SRIVASTAVA, Manish Srivastava}, TITLE = {Articulatory Gesture Rich Representation Learning of Phonological Units in Low Resource Settings}, BOOKTITLE = {International Conference on Statistical Language and Speech Processing}. YEAR = {2016}}
Recent literature presents evidence that both linguistic (phonemic) and non linguistic (speaker identity, emotional content) information resides at a lower dimensional manifold embedded richly inside the higher-dimensional spectral features like MFCC and PLP. Linguistic or phonetic units of speech can be broken down to a legal inventory of articulatory gestures shared across several phonemes based on their manner of articulation. We intend to discover a subspace rich in gestural information of speech and captures the invariance of similar gestures. In this paper, we investigate unsupervised techniques best suited for learning such a subspace. Main contribution of the paper is an approach to learn gesture-rich representation of speech automatically from data in completely unsupervised manner. This study compares the representations obtained through convolutional auto encoder (ConvAE) and standard unsupervised dimensionality reduction techniques such as manifold learning and Principal Component Analysis (PCA)through the task of phoneme classification. Manifold learning techniques such as Locally Linear Embedding (LLE), Isomap and Laplacian Eigenmaps are evaluated in this study. The representations which best separate different gestures are suitable for discovering subword units in case of low or zero resource speech conditions. Further, we evaluate the representation using Zero Resource Speech Challenge’s ABX discriminability measure. Results indicate that repre-sentation obtained through ConvAE and Isomap out-perform baseline MFCC features in the task of phoneme classification as well as ABX measure and induce separation between sounds composed of different set of gestures. We further cluster the representations using Dirichlet Process Gaussian Mixture Model (DPGMM) to automatically learn the cluster distribution of data and show that these clusters correspond to groups of similar manner of articulation. DPGMM distribution is used as apriori to obtain correspondence terms for robust ConvAE training
Towards deep learning in hindi ner: An approach to tackle the labelled data scarcity
ATHAVALE VINAYAK SANJAY,Shreenivas Bharadwaj,Monik Pamecha,PRABHU AMEYA PANDURANG,Manish Srivastava
International Conference on Natural Language Processing., ICON, 2016
@inproceedings{bib_Towa_2016, AUTHOR = {ATHAVALE VINAYAK SANJAY, Shreenivas Bharadwaj, Monik Pamecha, PRABHU AMEYA PANDURANG, Manish Srivastava}, TITLE = {Towards deep learning in hindi ner: An approach to tackle the labelled data scarcity}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2016}}
In this paper we describe an end to end Neural Model for Named Entity Recognition (NER) which is based on Bi-Directional RNN-LSTM. Almost all NER systems for Hindi use Language Specific features and handcrafted rules with gazetteers. Our model is language independent and uses no domain specific features or any handcrafted rules. Our models rely on semantic information in the form of word vectors which are learnt by an unsupervised learning algorithm on an unannotated corpus. Our model attained state of the art performance in both English and Hindi without the use of any morphological analysis or without using gazetteers of any sort.
Improved multimodal sentiment detection using stressed regions of audio
HARIKA ABBURI,Manish Srivastava,Suryakanth Gangashetty
IEEE Region 10 Conference, TENCON, 2016
@inproceedings{bib_Impr_2016, AUTHOR = {HARIKA ABBURI, Manish Srivastava, Suryakanth Gangashetty}, TITLE = {Improved multimodal sentiment detection using stressed regions of audio}, BOOKTITLE = {IEEE Region 10 Conference}. YEAR = {2016}}
Recent advancement of social media has led people to share the product reviews through various modalities such as audio, text and video. In this paper, an improved approach to detect the sentiment of an online spoken reviews based on its multi-modality natures (audio and text) is presented. To extract the sentiment from audio, Mel Frequency Cepstral Coefficients(MFCC) features are extracted at stressed significant regions which are detected based on the strength of excitation. Gaussian Mixture Models (GMM) classifier is employed to develop a sentiment model using these features. From results, it is observed that MFCC features extracted at stressed significance regions perform better than the features extracted from the whole audio input. Further from the transcript of the audio input,textual features are computed by Doc2vec vectors. Support Vector Machine (SVM) classifier is used to develop a sentiment model using these textual features. From experimental results it is observed that combining both the audio and text features results in improvement in the performance for detecting the sentiment of a review
Hand in Glove: Deep Feature Fusion Network Architectures for AnswerQuality Prediction in Community Question Answering
SUGGU SAI PRANEETH,TATAKUNTLA KUSHWANTH NAGA GOUTHAM,Manoj K. Chinnakotla,Manish Srivastava
International Conference on Computational Linguistics, COLING, 2016
@inproceedings{bib_Hand_2016, AUTHOR = {SUGGU SAI PRANEETH, TATAKUNTLA KUSHWANTH NAGA GOUTHAM, Manoj K. Chinnakotla, Manish Srivastava}, TITLE = {Hand in Glove: Deep Feature Fusion Network Architectures for AnswerQuality Prediction in Community Question Answering}, BOOKTITLE = {International Conference on Computational Linguistics}. YEAR = {2016}}
Community Question Answering (cQA) forums have become a popular medium for soliciting answers to specific user questions from experts and experienced users in a given topic. However,for a given question, users sometimes have to sift through a large number of low-quality or irrelevant answers to find out the answer which satisfies their information need. To alleviate this,the problem of Answer Quality Prediction (AQP) aims to predict the quality of an answer posted in response to a forum question. Current AQP systems either learn models using - a) various hand-crafted features (HCF) or b) Deep Learning (DL) techniques which automatically learn the feature representations.In this paper, we propose a novel approach for AQP known as -“Deep Feature Fusion Network(DFFN)”which combines the advantages of both hand-crafted features and deep learning based systems. Given a question-answer pair along with its metadata, a DFFN architecture indepen-dently - a) learns features using the Deep Neural Network (DNN) and b) computes handcrafted features leveraging various external resources and then combines the musing a fully connected neural network trained to predict the quality of the given answer. DFFN is an end-end differen-tiable model and trained as a single system. We propose two different DFFN architectures which vary mainly in the way they model the input question/answer pair - a) DFFN-CNN which uses a Convolutional Neural Network (CNN) and b) DFFN-BLNA which uses a Bi-directional LSTMwith Neural Attention (BLNA). Both these proposed variants of DFFN (DFFN-CNN and DFFN-BLNA) achieve state-of-the-art performance on the standard SemEval-2015 and SemEval-2016benchmark datasets and outperforms baseline approaches which individually employ either HCFor DL based techniques alone.
Vaidya: A Spoken Dialog System for Health Domain
BRIJ MOHAN LAL SRIVASTAVA,DANDA PRATHYUSHA,Manish Srivastava
International Conference on Natural Language Processing., ICON, 2016
@inproceedings{bib_Vaid_2016, AUTHOR = {BRIJ MOHAN LAL SRIVASTAVA, DANDA PRATHYUSHA, Manish Srivastava}, TITLE = {Vaidya: A Spoken Dialog System for Health Domain}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2016}}
In this paper, we introduce Vaidya, a spoken dialog system which is developed as part of the ITRA1project. The system is capable of providing an approximate diagnosis by accepting symptoms as free-form speech in real-time on both laptop and hand-held devices. The system focuses on challenges in speech recognition specific to Indian languages and capturing the intent of the user. Another challenge is to create models which are memory and CPU efficient for hand-held devices. We describe our progress, experiences and approaches in building the system that can handle English as the input speech. The system is evaluated using subjective statistical measure (Fleiss’ kappa) to assess the usability of the system
Kathaa: NLP Systems as Edge-Labeled Directed Acyclic MultiGraphs
SHARADA PRASANNA MOHANTY,NEHAL JAGDISH WANI,Manish Srivastava,Dipti Mishra Sharma
International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infr, OIAF4HLT | WS, 2016
@inproceedings{bib_Kath_2016, AUTHOR = {SHARADA PRASANNA MOHANTY, NEHAL JAGDISH WANI, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {Kathaa: NLP Systems as Edge-Labeled Directed Acyclic MultiGraphs}, BOOKTITLE = {International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infr}. YEAR = {2016}}
We present Kathaa, an Open Source web-based Visual Programming Framework for Natural Language Processing (NLP) Systems. Kathaa supports the design, execution and analysis of complex NLP systems by visually connecting NLP components from an easily extensible Module Library. It models NLP systems an edge-labeled Directed Acyclic MultiGraph, and lets the user use publicly co-created modules in their own NLP applications irrespective of their technical proficiency in Natural Language Processing. Kathaa exposes an intuitive web based Interface forthe users to interact with and modify complex NLP Systems; and a precise Module definition API to allow easy integration of new state of the art NLP components. Kathaa enables researchers to publish their services in a standardized format to enable the masses to use their services out ofthe box. The vision of this work is to pave the way for a system like Kathaa, to be the Lego blocks of NLP Research and Applications. As a practical use case we use Kathaa to visually implement the Sampark Hindi-Panjabi Machine Translation Pipeline and the Sampark Hindi-Urdu Machine Translation Pipeline, to demonstrate the fact that Kathaa can handle really complex NLP systems while still being intuitive for the end user.
Starting Small Learning Strategies for Speech Recognition
VYDANA HARI KRISHNA,BRIJ MOHAN LAL SRIVASTAVA,Manish Srivastava,Anil Kumar Vuppala
India Council International Conference, INDICON, 2016
@inproceedings{bib_Star_2016, AUTHOR = {VYDANA HARI KRISHNA, BRIJ MOHAN LAL SRIVASTAVA, Manish Srivastava, Anil Kumar Vuppala}, TITLE = {Starting Small Learning Strategies for Speech Recognition}, BOOKTITLE = {India Council International Conference}. YEAR = {2016}}
Designing various learning strategies has been gaining a lot of scientific interest during the recent progress of deep learning methodologies. Curriculum learning is a learning strategy aimed at training the neural network model by presenting the samples in a specific meaningful order rather than randomly sampling the training examples from the data distribution. In this work, we have explored starting small paradigm of curriculum learning technique for speech recognition. The starting small paradigm of curriculum learning is performed by a two step learning strategy. Training dataset is re-organized as a set of easily classifiable examples followed by the actual training dataset and the model is trained on the re-organized dataset. We hypothesize that by following the starting small learning paradigm the learning gets initialized in a better way and progresses to attain a better convergence. We propose to rank the toughness of the training example based on the posterior probabilities obtained using a previously trained model. Apart from re-arranging the training corpus starting small paradigm of curriculum learning is applied at model level. We consider the broad manner-class classification objective function as the smoother version of the phone class classification objective function. The model initially trained for broad class classification is later adapted for phone classification.In this work, we have used TIMIT and a subset of WallStreet Journal (WSJ) corpus to validate the experiments, both the learning strategies have shown consistently better performances across the two datasets compared to the baseline system trained by randomly sampling the dataset.
Shallow parsing pipeline for hindi-english code-mixed social media text
ARNAV SHARMA,SAKSHI GUPTA,RAVEESH MOTLANI,PIYUSH BANSAL,Manish Srivastava,Radhika Mamidi,Dipti Mishra Sharma
Conference of the North American Chapter of the Association for Computational Linguistics, NAACL, 2016
@inproceedings{bib_Shal_2016, AUTHOR = {ARNAV SHARMA, SAKSHI GUPTA, RAVEESH MOTLANI, PIYUSH BANSAL, Manish Srivastava, Radhika Mamidi, Dipti Mishra Sharma}, TITLE = {Shallow parsing pipeline for hindi-english code-mixed social media text}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics}. YEAR = {2016}}
n this study, the problem of shallow parsing of Hindi-English code-mixed social media text (CSMT) has been addressed. We have annotated the data, developed a language identifier, a normalizer, a part-of-speech tagger and a shallow parser. To the best of our knowledge, we are the first to attempt shallow parsing on CSMT. The pipeline developed has been made available to the research community with the goal of enabling better text analysis of Hindi English CSMT. The pipeline is accessible at this http URL.
Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text
ADITYA JOSHI,PRABHU AMEYA PANDURANG,Manish Srivastava,Vasudeva Varma Kalidindi
International Conference on Computational Linguistics, COLING, 2016
@inproceedings{bib_Towa_2016, AUTHOR = {ADITYA JOSHI, PRABHU AMEYA PANDURANG, Manish Srivastava, Vasudeva Varma Kalidindi}, TITLE = {Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text}, BOOKTITLE = {International Conference on Computational Linguistics}. YEAR = {2016}}
Sentiment analysis (SA) using code-mixed data from social media has several applications in opinion mining ranging from customer satisfaction to social campaign analysis in multilingual societies. Advances in this area are impeded by the lack of a suitable annotated dataset. We introduce a Hindi-English (Hi-En) code-mixed dataset for sentiment analysis and perform empirical analysis comparing the suitability and performance of various state-of-the-art SA methods in social media. In this paper, we introduce learning sub-word level representations in our LSTM (Subword-LSTM) architecture instead of character-level or word-level representations. This linguistic prior in our architecture enables us to learn the information about sentiment value of important morphemes. This also seems to work well in highly noisy text containing misspellings as shown in our experiments which is demonstrated in morpheme-level feature maps learned by our model. Also, we hypothesize that encoding this linguistic prior in the Subword-LSTM architecture leads to the superior performance. Our system attains accuracy 4-5% greater than traditional approaches on our dataset, and also outperforms the available system for sentiment analysis in Hi-En code-mixed text by 18%.
IIITH at BioASQ Challenge 2015 Task 3a: Extreme Classification of PubMed Articles using MeSH Labels
AVINASH KAMINENI,NAUSHEEN FATMA,ARPITA DAS,Manish Srivastava,Manoj Chinnakotla
International Conference of the CLEF Association, CLEFS, 2015
@inproceedings{bib_IIIT_2015, AUTHOR = {AVINASH KAMINENI, NAUSHEEN FATMA, ARPITA DAS, Manish Srivastava, Manoj Chinnakotla}, TITLE = {IIITH at BioASQ Challenge 2015 Task 3a: Extreme Classification of PubMed Articles using MeSH Labels}, BOOKTITLE = {International Conference of the CLEF Association}. YEAR = {2015}}
Automating the process of indexing journal abstracts has been a topic of research for several years. Biomedical Semantic Indexing aims to assign correct MeSH terms to the PubMed documents. In this paper we report our participation in the Task 3a of BioASQ challenge 2015. The participating teams were provided with PubMed articles and asked to return relevant MeSH terms. We tried three different approaches: Nearest Neighbours, IDF-Ratio based indexing and multi-label classification. The official challenge results demonstrate that we consistently performed better than the baseline approaches for Task 3a.
IIITH at BioASQ Challange 2015 Task 3b: Bio-Medical Question Answering System
HARISH YENALA,AVINASH KAMINENI,Manish Srivastava,Manoj Chinnakotla
International Conference of the CLEF Association, CLEFS, 2015
@inproceedings{bib_IIIT_2015, AUTHOR = {HARISH YENALA, AVINASH KAMINENI, Manish Srivastava, Manoj Chinnakotla}, TITLE = {IIITH at BioASQ Challange 2015 Task 3b: Bio-Medical Question Answering System}, BOOKTITLE = {International Conference of the CLEF Association}. YEAR = {2015}}
In this paper, we describe our participation in the 2015 BioASQ challenge on Bio-Medical Question Answering. For Question Answering task (Task 3b), teams were provided with natural language questions and asked to retrieve responses from PubMed corpus in the form of documents, snippets, concepts and RDF triplets (Phase A) and direct answers (Phase B). For Phase A, we took the support of PubMed search engine and our snippet extraction technique. In our QA system, apart from the standard techniques discussed in literature, we tried the following novel techniques to - a) leverage web search results for improving question processing and b) identify domain words and define a new answer ranking function based on number of common domain words. We scored an F-measure of 0.193 for document extraction and F-measure of 0.0717 in snippet generation
Recognition of Chemical Entity Mention in Patents using CRF, Domain Specific Dictionaries and Features
Venkata Ravindra Nittala,Srinivas Jonnalagadda,Manish Srivastava
BioCreative Challenge Evaluation Workshop, BCCE, 2015
@inproceedings{bib_Reco_2015, AUTHOR = {Venkata Ravindra Nittala, Srinivas Jonnalagadda, Manish Srivastava}, TITLE = {Recognition of Chemical Entity Mention in Patents using CRF, Domain Specific Dictionaries and Features}, BOOKTITLE = {BioCreative Challenge Evaluation Workshop}. YEAR = {2015}}
We present a system employing domain specific dictionaries and features to recognize chemical entities. The system utilizes sentence segmentation, tokenization, feature generation, Conditional Random Field (CRF) training and one post-processing step. The dictio-naries were compiled from PubChem, Wikipedia, ChEMBL, DrugBank,word2vec clusters from US patents belonging to A61K class. We report the evaluation results of the run where development set was not included as part of the training set. The best performing model for CEMP taskh as the micro average precision, recall and F-score values of87.26%,79.98%and83.46%, respectively.
Developing Part-of-Speech Tagger for a Resource Poor Language :Sindhi
RAVEESH MOTLANI,Harsh Lalwani,Manish Srivastava,Dipti Mishra Sharma
Conference on Language and Technology,, CLT, 2015
@inproceedings{bib_Deve_2015, AUTHOR = {RAVEESH MOTLANI, Harsh Lalwani, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {Developing Part-of-Speech Tagger for a Resource Poor Language :Sindhi}, BOOKTITLE = {Conference on Language and Technology,}. YEAR = {2015}}
Sindhi is an Indo-Aryan language spoken by more than 58 million speakers around the world. It is currently a resource poor language which is harmed by the literature being written in multiple scripts. Though the language is widely spoken,primarily, across two countries, the written form is not standardized. In this paper, we seek to develop resources for basic language processing for Sindhi language, in one of its preferred scripts (Devanagari), because a language that seeks to survive in the modern information society requires language technology products. This paper presents our work on building a stochastic Part-of-Speech tagger for Sindhi-Devanagari using conditional random fields with linguistically motivated features. The paper also discusses the steps taken to construct a part-of-speech annotated corpus for Sindhi in Devanagari script. We have also explained in detail the features that were used for training the tagger, which resulted in a part of speech tagger nearing 92% average accuracy.
" Answer ka type kya he?" Learning to Classify Questions in Code-Mixed Language
CHANDU KHYATHI RAGHAVI,Manoj Chinnakotla,Manish Srivastava
International Conference on World wide web, WWW, 2015
@inproceedings{bib_"_An_2015, AUTHOR = {CHANDU KHYATHI RAGHAVI, Manoj Chinnakotla, Manish Srivastava}, TITLE = {" Answer ka type kya he?" Learning to Classify Questions in Code-Mixed Language}, BOOKTITLE = {International Conference on World wide web}. YEAR = {2015}}
Code-Mixing (CM) is defined as the embedding of linguistic units such as phrases, words, and morphemes of one language into an utterance of another language. CM is a natural phenomenon observed in many multilingual societies. It helps in speeding-up communication and allows wider variety of expression due to which it has become a popular mode of communication in social media forums like Facebook and Twitter. However, current Question Answering (QA) re-search and systems only support expressing a question in a single language which is an unrealistic and hard proposition especially for certain domains like health and technology. In this paper, we take the first step towards the development of a full-fledged QA system in CM language which is building a Question Classification (QC) system. The QC system analyzes the user question and infers the expected Answer Type (AType). The AType helps in locating and verifying the answer as it imposes certain type-specific constraints.We learn a basic Support Vector Machine (SVM) based QC system for English-Hindi CM questions. Due to the inherent complexities involved in processing CM language and also the unavailability of language processing resources such POS taggers, Chunkers, Parsers, we design our current system using only word-level resources such as language iden-tification, transliteration and lexical translation. To reducedata sparsity and leverage resources available in a resource-rich language, in stead of extracting features directly from the original CM words, we translate them commonly into English and then perform featurization. We created an evaluation dataset for this task and our system achieves an ac-curacy of 63% and 45% in coarse-grained and fine-grained categories of the question taxonomy. The idea of translating features into English indeed helps in improving accuracy over the uni-gram baseline
A LANGUAGE MODEL BASED APPROACH TOWARDS LARGE SCALE AND LIGHT WEIGHT LANGUAGE IDENTIFICATION SYSTEMS
BRIJ MOHAN LAL SRIVASTAVA,VYDANA HARI KRISHNA,Anil Kumar Vuppala,Manish Srivastava
Technical Report, arXiv, 2015
@inproceedings{bib_A_LA_2015, AUTHOR = {BRIJ MOHAN LAL SRIVASTAVA, VYDANA HARI KRISHNA, Anil Kumar Vuppala, Manish Srivastava}, TITLE = {A LANGUAGE MODEL BASED APPROACH TOWARDS LARGE SCALE AND LIGHT WEIGHT LANGUAGE IDENTIFICATION SYSTEMS}, BOOKTITLE = {Technical Report}. YEAR = {2015}}
Multilingual spoken dialogue systems have gained prominence in the recent past necessitating the requirement for a front-end Language Identification (LID) system. Most of the existing LID systems rely on modeling the language discriminative information from low-level acoustic features. Due tothe variabilities of speech (speaker and emotional variabilities, etc.), large-scale LID systems developed using low-levelacoustic features suffer from a degradation in the performance. In this approach, we have attempted to model the higher level language discriminative phonotactic information for developing an LID system. In this paper, the input speech signal is tokenized to phone sequences by using a language independent phone recognizer. The language discriminative phonotactic information in the obtained phone sequences are modeled using statistical and recurrent neural network based language modeling approaches. As this approach, relies on higher level phonotactical information it is more robust to variabilities of speech. Proposed approach is computationally light weight, highly scalable and it can be used in complement with the existing LID systems
High Speed Quantile Based Histogram Equalization for Brightness Preservation and Contrast Enhancement
Mayank Tiwari,Bhupendra Gupta,Manish Srivastava
IET Image Processing, IET-IP, 2014
@inproceedings{bib_High_2014, AUTHOR = {Mayank Tiwari, Bhupendra Gupta, Manish Srivastava}, TITLE = {High Speed Quantile Based Histogram Equalization for Brightness Preservation and Contrast Enhancement}, BOOKTITLE = {IET Image Processing}. YEAR = {2014}}
In this paper we introduce a new histogram equalization based contrast enhancement method called High Speed Quantile Based Histogram Equalization (HSQHE) suitable for high contrast digital images. The proposed method is an effective tool to deal with the “mean-shift” problem, which is a usual problem with the histogram equalization based contrast enhancement methods. The main idea of HSQHE is to divide input image histogram into two or more sub-histograms, where segmentation is based on quantile values. Since the histogram segmentation is based on the quantile values, the entire spectrum of gray level will always play an important role in enhancement process. Also the proposed method does not require the recursive segmentation of the histogram as in many other methods, and hence proposed method requires less time for segmentation. The experimental results show that the performance of proposed HSQHE method is better as compared to other existing methods available in literature. Also this method preserves image brightness more accurately than the prevailing state of art and takes less time as compared to the other methods.
IIIT-H System Submission for FIRE2014 Shared Task on Transliterated Search
IRSHAD AHMAD BHAT,Vandan Mujadia,TAMMEWAR ANIRUDDHA UTTAM,RIYAZ AHMAD BHAT,Manish Srivastava
Forum for Information Retrieval Evaluation, FIRE, 2014
@inproceedings{bib_IIIT_2014, AUTHOR = {IRSHAD AHMAD BHAT, Vandan Mujadia, TAMMEWAR ANIRUDDHA UTTAM, RIYAZ AHMAD BHAT, Manish Srivastava}, TITLE = {IIIT-H System Submission for FIRE2014 Shared Task on Transliterated Search}, BOOKTITLE = {Forum for Information Retrieval Evaluation}. YEAR = {2014}}
This paper describes our submission for FIRE 2014 Shared Task on Transliterated Search. The shared task features two sub-tasks: Query word labeling and Mixed-script Ad hoc retrieval for Hindi Song Lyrics. Query Word Labeling is on token level language identification of query words in code-mixed queries and back-transliteration of identified Indian language words into their native scripts. We have developed letter based language models for the token level language identification of query words and a structured perceptron model for back-transliteration of Indic words. The second subtask for Mixed-script Ad hoc retrieval for Hindi Song Lyrics is to retrieve a ranked list of songs from a corpus of Hindi song lyrics given an input query in Devanagari or transliterated Roman script. We have used edit distance based query expansion and language modeling followed by relevance based reranking for the retrieval of relevant Hindi Song lyrics for a given query.