Progressive Perturbation with KTO for Enhanced Machine Translation of Indian Languages
Yash Bhaskar,Ketaki Mangesh Shetye,Vandan Mujadia,Dipti Mishra Sharma,Parameswari Krishnamurthy
@inproceedings{bib_Prog_2025, AUTHOR = {Yash Bhaskar, Ketaki Mangesh Shetye, Vandan Mujadia, Dipti Mishra Sharma, Parameswari Krishnamurthy}, TITLE = {Progressive Perturbation with KTO for Enhanced Machine Translation of Indian Languages}, BOOKTITLE = {MT Summit}. YEAR = {2025}}
This study addresses the critical challenge of
data scarcity in machine translation for Indian
languages, particularly given their morpholog-
ical complexity and limited parallel data. We
investigate an effective strategy to maximize
the utility of existing data by generating nega-
tive samples from positive training instances us-
ing a progressive perturbation approach. This
is used to align the model with preferential
data using Kahneman-Tversky Optimization
(KTO). Comparing it against traditional Su-
pervised Fine-Tuning (SFT), we demonstrate
how generating negative samples and leverag-
ing KTO enhances data efficiency. By creat-
ing rejected samples through progressively per-
turbed translations from the available dataset,
we fine-tune the Llama 3.1 Instruct 8B model
using QLoRA across 16 language directions, in-
cluding English, Hindi, Bangla, Tamil, Telugu,
and Santali. Our results show that KTO-based
preference alignment with progressive pertur-
bation consistently outperforms SFT, achieving
significant gains in translation quality with an
average BLEU increase of 1.84 to 2.47 and
CHRF increase of 2.85 to 4.01 compared to
SFT for selected languages, while using the
same positive training samples and under simi-
lar computational constraints. This highlights
the potential of our negative sample genera-
tion strategy within KTO, especially in low-
resource scenarios.
Towards Large Language Model driven Reference-less Translation Evaluation for English and Indian Languages
@inproceedings{bib_Towa_2024, AUTHOR = {Vandan Mujadia, PRUTHWIK MISHRA, Arafat Ahsan, Dipti Mishra Sharma}, TITLE = {Towards Large Language Model driven Reference-less Translation Evaluation for English and Indian Languages}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2024}}
With the primary focus on evaluating the ef- fectiveness of large language models for au- tomatic reference-less translation assessment, this work presents our experiments on mimick- ing human direct assessment to evaluate the quality of translations in English and Indian languages. We constructed a translation evalua- tion task where we performed zero-shot learn- ing, in-context example-driven learning, and fine-tuning of large language models to pro- vide a score out of 100, where 100 represents a perfect translation and 1 represents a poor translation. We compared the performance of our trained systems with existing methods such as COMET, BERT-Scorer, and LABSE, and found that the LLM-based evaluator (LLaMA- 2-13B) achieves a comparable or higher overall correlation with human judgments for the con- sidered Indian language pairs (Refer figure 1).
Surupendu Gangopadhyay,Prasenjit Majumder,Baban Gain,Ramakrishna Appicharla,Asif Ekbal,Arafat Ahsan,Dipti Mishra Sharma
@inproceedings{bib_Over_2024, AUTHOR = {Surupendu Gangopadhyay, Prasenjit Majumder, Baban Gain, Ramakrishna Appicharla, Asif Ekbal, Arafat Ahsan, Dipti Mishra Sharma}, TITLE = {Overview of MTIL Track at FIRE 2023: Machine Translation for Indian Languages}, BOOKTITLE = {Forum for Information Retrieval Evaluation}. YEAR = {2024}}
The objective of the MTIL track in FIRE 2023 was to encourage the development of Indian Language to Indian Language (IL-IL) Neural Machine Translation models. The languages covered in the track included Hindi, Gujarati, Kannada, Odia, Punjabi, Urdu, Telugu, Kashmiri, and Sindhi. The track consists of two tasks: (i) a General Translation Task and (ii) a Domain specific Translation Task with Governance and Healthcare being the chosen domains. For the listed languages, we proposed 12 diverse language directions for the general domain translation task and 8 each for healthcare and governance domains. Participants were encouraged to submit models for one or more language pairs. We witnessed the creation of 34 distinct models spanning various language pairs and domains. Model assessments were conducted using five evaluation metrics: BLEU, CHRF, CHRF++, TER, and COMET. The submitted model outputs were ranked based on the CHRF score.
Estimating the Quality of Translated Medical Texts using
Back Translation & Resource Description Framework
@inproceedings{bib_Esti_2024, AUTHOR = {BINAY KUMAR NEEKHRA, Dipti Mishra Sharma}, TITLE = {Estimating the Quality of Translated Medical Texts using
Back Translation & Resource Description Framework}, BOOKTITLE = {Semantic Web Solutions for Large-scale Biomedical Data Analytics}. YEAR = {2024}}
How can we effectively estimate the quality of translated texts in the medical field, where back-translation is
usually available and/or recommended for sensitive documents. This paper proposes a novel metric, GATE1
,
for translation quality estimation task, leveraging the Resource Description Framework (RDF) to encode both
semantic and syntactical information of the original and back-translated sentences into RDF graphs. The distance
between these graphs is measured to get the semantic similarity score to assess the quality of the translation.
Unlike traditional metrics like BLEU and METEOR, our approach is reference-less, capturing both semantic and
syntactical information for a comprehensive assessment of translation quality. Our results correlate better with
human judgment, giving a better Pearson correlation (0.357) as compared to BLEU (0.200), thereby showing ~70%
improvement over BLEU. Our research shows that, in the field of translation evaluation, existing resources like
back-translation and RDF could be useful.
Estimating the Quality of Translated Medical Texts using
Back Translation & Resource Description Framework
@inproceedings{bib_Esti_2024, AUTHOR = {BINAY KUMAR NEEKHRA, Dipti Mishra Sharma}, TITLE = {Estimating the Quality of Translated Medical Texts using
Back Translation & Resource Description Framework}, BOOKTITLE = {Extended Semantic Web Conference}. YEAR = {2024}}
How can we effectively estimate the quality of translated texts in the medical field, where back-translation is
usually available and/or recommended for sensitive documents. This paper proposes a novel metric, GATE 1 ,
for translation quality estimation task, leveraging the Resource Description Framework (RDF) to encode both
semantic and syntactical information of the original and back-translated sentences into RDF graphs. The distance
between these graphs is measured to get the semantic similarity score to assess the quality of the translation.
Unlike traditional metrics like BLEU and METEOR, our approach is reference-less, capturing both semantic and
syntactical information for a comprehensive assessment of translation quality. Our results correlate better with
human judgment, giving a better Pearson correlation (0.357) as compared to BLEU (0.200), thereby showing ~70%
improvement over BLEU. Our research shows that, in the field of translation evaluation, existing resources like
back-translation and RDF could be useful.
Sankalp Sanjay Bahad,Pruthwik Mishra,Karunesh Arora,Rakesh Chandra Balabantaray,Dipti Mishra Sharma,Parameswari Krishnamurthy
@inproceedings{bib_Fine_2024, AUTHOR = {Sankalp Sanjay Bahad, Pruthwik Mishra, Karunesh Arora, Rakesh Chandra Balabantaray, Dipti Mishra Sharma, Parameswari Krishnamurthy}, TITLE = {Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages}, BOOKTITLE = {NAACL Student Research Workshop}. YEAR = {2024}}
Named Entity Recognition (NER) is a use- ful component in Natural Language Process- ing (NLP) applications. It is used in various tasks such as Machine Translation, Summa- rization, Information Retrieval, and Question- Answering systems. The research on NER is centered around English and some other ma- jor languages, whereas limited attention has been given to Indian languages. We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recog- nition for Indian Languages. We present a hu- man annotated named entity corpora of ∼40K sentences for 4 Indian languages from two of the major Indian language families. Addition- ally,we present a multilingual model fine-tuned on our dataset, which achieves an F1 score of ∼0.80 on our dataset on average. We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
@inproceedings{bib_Towa_2024, AUTHOR = {Chayan Kochar, Vandan Mujadia, PRUTHWIK MISHRA, Dipti Mishra Sharma}, TITLE = {Towards Disfluency Annotated Corpora for Indian Languages}, BOOKTITLE = {International Conference Computational Linguistics Workshops}. YEAR = {2024}}
In the natural course of spoken language, individuals often engage in thinking and self-correction during speech production. These instances of interruption or correction are commonly referred to as disfluencies. When preparing data for subsequent downstream NLP tasks, these linguistic elements can be systematically removed, or handled as required, to enhance data quality. In this study, we present a comprehensive research on disfluencies in Indian languages. Our approach involves not only annotating real-world conversation transcripts but also conducting a detailed analysis of linguistic nuances inherent to Indian languages that are necessary to consider during annotation. Additionally, we introduce a robust algorithm for the synthetic generation of disfluent data. This algorithm aims to facilitate more effective model training for the identification of disfluencies in real-world conversations, thereby contributing to the advancement of disfluency research in Indian languages
Assessing Translation capabilities of Large Language Models involving English and Indian Languages
Vandan Mujadia,Urlana Ashok,Yash Bhaskar,Penumalla Aditya Pavani,Kukkapalli Shravya,Parameswari Krishnamurthy,Dipti Mishra Sharma
@inproceedings{bib_Asse_2023, AUTHOR = {Vandan Mujadia, Urlana Ashok, Yash Bhaskar, Penumalla Aditya Pavani, Kukkapalli Shravya, Parameswari Krishnamurthy, Dipti Mishra Sharma}, TITLE = {Assessing Translation capabilities of Large Language Models involving English and Indian Languages}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Generative Large Language Models (LLMs) have achieved remarkable advancements in var- ious NLP tasks. In this work, our aim is to ex- plore the multilingual capabilities of large lan- guage models by using machine translation as a task involving English and 22 Indian languages. We first investigate the translation capabilities of raw large language models, followed by ex- ploring the in-context learning capabilities of the same raw models. We fine-tune these large language models using parameter efficient fine- tuning methods such as LoRA and additionally with full fine-tuning. Through our study, we have identified the best performing large lan- guage model for the translation task involving LLMs, which is based on LLaMA.
Processing English Verb Phrase Ellipsis for Conversational English-Hindi Machine Translation
@inproceedings{bib_Proc_2023, AUTHOR = {Aniruddha Prashant Deshpande, Dipti Mishra Sharma, }, TITLE = {Processing English Verb Phrase Ellipsis for Conversational English-Hindi Machine Translation}, BOOKTITLE = {International Conference on Human-Informed Translation and Interpreting Technology}. YEAR = {2023}}
n this paper, we try to tackle the problem of erroneous English-Hindi machine translation (MT) outputs due to the presence of the Verb Phrase Ellipsis (VPE) in English. The phenomenon of VPE is prominent in spoken English, and the antecedent to the ellipsis can come from previous sentences in a conversation as well. MT systems translate sentences as a whole and ignore the contextual information from the previous sentences. For these two reasons, spoken English-Hindi transla- tions suffer. We approached this problem by manually annotating 1200 two-person conversations that contain VPE and by studying how their resolution affects the translation qualities. Using our studies, we designed a rule-based system for the detection and resolution of VPE in English with the goal of improving their subsequent Hindi translation qualities. Our rule-based system is capable of the following: 1) Detection of VPE, 2) Resolution of Elided Head verb, 3) Resolution of Elided Head verb’s children, 4) Resolution of non-verbal predicates of a copula or a ’be’ main verb, 5) Modifying original sentence in the conversation with the resolved verb phrase. We also tested the system’s performance on VPE datasets outside of our annotated data. In this paper, we present our annotated corpus on conversational English VPE, our rule-based system to tackle VPE in the context of improving English-Hindi MT, the observations made as we designed this rule-based system and the performance-related observations of our system
Vandan Mujadia,S. Umesh,Hema A. Murthy,Rajeev Sangal,Dipti Mishra Sharma
@inproceedings{bib_Towa_2023, AUTHOR = {Vandan Mujadia, S. Umesh, Hema A. Murthy, Rajeev Sangal, Dipti Mishra Sharma}, TITLE = {Towards Speech to Speech Machine Translation focusing on Indian Languages}, BOOKTITLE = {European Chapter of the Association for Computational Linguistics System Demonstrations}. YEAR = {2023}}
We introduce an SSMT (Speech to Speech Machine Translation, aka Speech to Speech Video Translation) Pipeline1 , as a web application for translating videos from one language to another by cascading multiple language modules. Our speech translation system combines highly accurate speech to text (ASR) for Indian English, pre-possessing modules to bridge ASRMT gaps such as spoken disfluency and punctuation, robust machine translation (MT) systems for multiple language pairs, SRT module for translated text, text to speech (TTS) module and a module to render translated synthesized audio on the original video. It is user-friendly, flexible, and easily accessible system. We aim to provide a complete configurable speech translation experience to users and researchers with this system. It also supports human intervention where users can edit outputs of different modules and the edited output can then be used for subsequent processing to improve overall output quality. By adopting a human-in-theloop approach, the aim is to configure technology in such a way where it can assist humans and help to reduce the involved human efforts in speech translation involving English and Indian languages. As per our understanding, this is the first fully integrated system for English to Indian languages (Hindi, Telugu, Gujarati, Marathi, and Punjabi) video translation. Our evaluation shows that one can get 3.5+ MOS score using the developed pipeline with human intervention for English to Hindi. A short video demonstrating our system is available at https://youtu.be/MVftzoeRg48.
Testing a computational model of causative overgeneralizations: Child judgment and production data from English, Hebrew, Hindi, Japanese and K’iche’
Ben Ambridge,Bhuvana Narasimhan,Dipti Mishra Sharma,Laura Doherty,Ramya Maitreyee,Tomoko Tatsumi,Shira Zicherman,Pedro Mateo Pedro,Ayuno Kawakami,Amy Bidgood,Clifton Pye
Open Research Europe, OREU, 2022
@inproceedings{bib_Test_2022, AUTHOR = {Ben Ambridge, Bhuvana Narasimhan, Dipti Mishra Sharma, Laura Doherty, Ramya Maitreyee, Tomoko Tatsumi, Shira Zicherman, Pedro Mateo Pedro, Ayuno Kawakami, Amy Bidgood, Clifton Pye}, TITLE = {Testing a computational model of causative overgeneralizations: Child judgment and production data from English, Hebrew, Hindi, Japanese and K’iche’}, BOOKTITLE = {Open Research Europe}. YEAR = {2022}}
How do language learners avoid the production of verb argument structure overgeneralization errors (*The clown laughed the man c.f. The clown made the man laugh), while retaining the ability to apply such generalizations productively when appropriate? This question has long been seen as one that is both particularly central to acquisition research and particularly challenging. Focussing on causative overgeneralization errors of this type, a previous study reported a computational model that learns, on the basis of corpus data and human-derived verb-semantic-feature ratings, to pred
The LTRC Hindi-Telugu Parallel Corpus.
Vandan Mujadia,Dipti Mishra Sharma
International Conference on Language Resources and Evaluation, LREC, 2022
@inproceedings{bib_The__2022, AUTHOR = {Vandan Mujadia, Dipti Mishra Sharma}, TITLE = {The LTRC Hindi-Telugu Parallel Corpus.}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2022}}
We present the Hindi-Telugu Parallel Corpus of different technical domains such as Natural Science, Computer Science, Law and Healthcare along with the General domain. The qualitative corpus consists of 700K parallel sentences of which 535K sentences were created using multiple methods such as extract, align and review of Hindi-Telugu corpora, end-to-end human translation, iterative back-translation driven post-editing and around 165K parallel sentences were collected from available sources in the public domain. We present the comparative assessment of created parallel corpora for representativeness and diversity. The corpus has been pre-processed for machine translation, and we trained a neural machine translation system using it and report state-of-the-art baseline results on the developed development set over multiple domains and on available benchmarks. With this, we define a new task on Domain Machine Translation for low resource language pairs such as Hindi and Telugu. The developed corpus (535K) is freely available for non-commercial research and to the best of our knowledge, this is the well curated, largest, publicly available domain parallel corpus for Hindi-Telugu.
HAWP: a Dataset for Hindi Arithmetic Word Problem Solving.
Harshita Sharma,PRUTHWIK MISHRA,Dipti Mishra Sharma
International Conference on Language Resources and Evaluation, LREC, 2022
@inproceedings{bib_HAWP_2022, AUTHOR = {Harshita Sharma, PRUTHWIK MISHRA, Dipti Mishra Sharma}, TITLE = {HAWP: a Dataset for Hindi Arithmetic Word Problem Solving.}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2022}}
Word Problem Solving remains a challenging and interesting task in NLP. A lot of research has been carried out to solve different genres of word problems with various complexity levels in recent years. However, most of the publicly available datasets and work has been carried out for English. Recently there has been a surge in this area of word problem solving in Chinese with the creation of large benchmark datastes. Apart from these two languages, labeled benchmark datasets for low resource languages are very scarce. This is the first attempt to address this issue for any Indian Language, especially Hindi. In this paper, we present HAWP (Hindi Arithmetic Word Problems), a dataset consisting of 2336 arithmetic word problems in Hindi. We also developed baseline systems for solving these word problems. We also propose a new evaluation technique for word problem solvers taking equation equivalence into account.
Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages
Anusha P,Arun Kumar,Ashish Seth,Bhagyashree M,Ishika Gupta,Jom Kuriakose,Jordan Fernandes2,Dipti Mishra Sharma,Rajeev Sangal
Technical Report, arXiv, 2022
@inproceedings{bib_Tech_2022, AUTHOR = {Anusha P, Arun Kumar, Ashish Seth, Bhagyashree M, Ishika Gupta, Jom Kuriakose, Jordan Fernandes2, Dipti Mishra Sharma, Rajeev Sangal}, TITLE = {Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Cross-lingual dubbing of lecture videos requires the transcription of the original audio, correction and removal of disfluencies, domain term discovery, text-to-text translation into the target language, chunking of text using target language rhythm, text-to-speech synthesis followed by isochronous lipsyncing to the original video. This task becomes challenging when the source and target languages belong to different language families, resulting in differences in generated audio duration. This is further compounded by the original speaker’s rhythm, especially for extempore speech. This paper describes the challenges in regenerating English lecture videos in Indian languages semi-automatically. A prototype is developed for dubbing lectures into 9 Indian languages. A mean-opinion-score (MOS) is obtained for two languages, Hindi and Tamil, on two different courses. The output video is compared with the original video in terms of MOS (1-5) and lip synchronisation with scores of 4.09 and 3.74, respectively. The human effort also reduces by 75%.
Gui at MixMT 2022: English-Hinglish: An MT approach for translation of code mixed data
Akshat Gahoi,Saransh Rajput,Jayant Duneja,Tanvi Kamble,Anshul Padhi,Dipti Mishra Sharma,Shivam Sadashiv Mangale,Vasudeva Varma Kalidindi
Technical Report, arXiv, 2022
@inproceedings{bib_Gui__2022, AUTHOR = {Akshat Gahoi, Saransh Rajput, Jayant Duneja, Tanvi Kamble, Anshul Padhi, Dipti Mishra Sharma, Shivam Sadashiv Mangale, Vasudeva Varma Kalidindi}, TITLE = {Gui at MixMT 2022: English-Hinglish: An MT approach for translation of code mixed data}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Code-mixed machine translation has become an important task in multilingual communities and extending the task of machine translation to code mixed data has become a common task for these languages. In the shared tasks of WMT 2022, we try to tackle the same for both English + Hindi to Hinglish and Hinglish to English. The first task dealt with both Roman and Devanagari script as we had monolingual data in both English and Hindi whereas the second task only had data in Roman script. To our knowledge, we achieved one of the top ROUGE-L and WER scores for the first task of Monolingual to Code-Mixed machine translation. In this paper, we discuss the use of mBART with some special pre-processing and post-processing (transliteration from Devanagari to Roman) for the first task in detail and the experiments that we performed for the second task of translating code-mixed Hinglish to monolingual English.
Building Odia Shallow Parser
PRUTHWIK MISHRA,Dipti Mishra Sharma
Technical Report, arXiv, 2022
@inproceedings{bib_Buil_2022, AUTHOR = {PRUTHWIK MISHRA, Dipti Mishra Sharma}, TITLE = {Building Odia Shallow Parser}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Shallow parsing is an essential task for many NLP applications like machine translation, summarization, sentiment analysis, aspect identification and many more. Quality annotated corpora is critical for building accurate shallow parsers. Many Indian languages are resource poor with respect to the availability of corpora in general. So, this paper is an attempt towards creating quality corpora for shallow parsers. The contribution of this paper is two folds: creation pos and chunk annotated corpora for Odia and development of baseline systems for pos tagging and chunking in Odia.
Domain Adaptation for Hindi-Telugu Machine Translation using Domain Specific Back Translation
Hema Ala,Vandan Mujadia,Dipti Mishra Sharma
Recent advance in Natural language Processing, RANLP, 2021
@inproceedings{bib_Doma_2021, AUTHOR = {Hema Ala, Vandan Mujadia, Dipti Mishra Sharma}, TITLE = {Domain Adaptation for Hindi-Telugu Machine Translation using Domain Specific Back Translation}, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2021}}
In this paper, we present a novel approach for domain adaptation in Neural Machine Translation which aims to improve the translation quality over a new domain. Adapting new domains is a highly challeng- ing task for Neural Machine Translation on limited data, it becomes even more diffi- cult for technical domains such as Chem- istry and Artificial Intelligence due to spe- cific terminology, etc. We propose Domain Specific Back Translation method which uses available monolingual data and gen- erates synthetic data in a different way. This approach uses Out Of Domain words. The approach is very generic and can be applied to any language pair for any do- main. We conduct our experiments on Chemistry and Artificial Intelligence do- mains for Hindi and Telugu in both direc- tions. It has been observed that the usage of synthetic data created by the proposed algorithm improves the BLEU scores sig- nificantly
Deep Contextual Punctuator for NLG Text
Vandan Mujadia,PRUTHWIK MISHRA,Dipti Mishra Sharma
Sentence End and Punctuation Prediction in NLG Text, SEPP- NLG, 2021
@inproceedings{bib_Deep_2021, AUTHOR = {Vandan Mujadia, PRUTHWIK MISHRA, Dipti Mishra Sharma}, TITLE = {Deep Contextual Punctuator for NLG Text}, BOOKTITLE = {Sentence End and Punctuation Prediction in NLG Text}. YEAR = {2021}}
This paper describes our team oneNLP’s (LTRC, IIIT-Hyderabad) participation for the SEPP-NLG 2021 tasks1, Sentence End and Punctuation Prediction in NLG Text-2021. We applied sequence to tag prediction over contextual embedding as fine-tuning for both of these tasks. We also explored the use of multilingual Bert and multitask learning for these tasks on English, German, French and Italian.
English-Marathi Neural Machine Translation for LoResMT 2021
Vandan Mujadia,Dipti Mishra Sharma
Workshop on Technologies for MT of Low Resource Languages, LoResMT, 2021
@inproceedings{bib_Engl_2021, AUTHOR = {Vandan Mujadia, Dipti Mishra Sharma}, TITLE = {English-Marathi Neural Machine Translation for LoResMT 2021}, BOOKTITLE = {Workshop on Technologies for MT of Low Resource Languages}. YEAR = {2021}}
In this paper, we (team - oneNLP-IIITH) describe our Neural Machine Translation approaches for English-Marathi (both direction) for LoResMT-20211 . We experimented with transformer based Neural Machine Translation and explored the use of different linguistic features like POS and Morph on subword unit for both English-Marathi and Marathi-English. In addition, we have also explored forward and backward translation using web-crawled monolingual data. We obtained 22.2 (overall 2 nd) and 31.3 (overall 1 st) BLEU scores for English-Marathi and Marathi-English on respectively
A Transformer Based Approach towards Identification of Discourse Unit Segments and Connectives
BAKSHI SAHIL,Dipti Mishra Sharma
Discourse Relation Parsing and Treebanking, DISRPT, 2021
@inproceedings{bib_A_Tr_2021, AUTHOR = {BAKSHI SAHIL, Dipti Mishra Sharma}, TITLE = {A Transformer Based Approach towards Identification of Discourse Unit Segments and Connectives}, BOOKTITLE = {Discourse Relation Parsing and Treebanking}. YEAR = {2021}}
Discourse parsing, which involves understand- ing the structure, information flow, and mod- eling the coherence of a given text, is an im- portant task in natural language processing. It forms the basis of several natural language pro- cessing tasks such as question-answering, text summarization, and sentiment analysis. Dis- course unit segmentation is one of the funda- mental tasks in discourse parsing and refers to identifying the elementary units of text that combine to form a coherent text. In this pa- per, we present a transformer based approach towards the automated identification of dis- course unit segments and connectives. Early approaches towards segmentation relied on rule-based systems using POS tags and other syntactic information to identify discourse seg- ments. Recently, transformer based neural sys- tems have shown promising results in this do- main. Our system, SegFormers, employs this transformer based approach to perform multi- lingual discourse segmentation and connective identification across 16 datasets encompassing 11 languages and 3 different annotation frame- works. We evaluate the system based on F1 scores for both tasks, with the best system re- porting the highest F1 score of 97.02% for the treebanked English RST-DT dataset
Stress Rules from Surface Forms: Experiments with Program Synthesis
Saujas Srinivasa Vaduguru,Partho Sarthi,Monojit Choudhury,Dipti Mishra Sharma
International Conference on Natural Language Processing., ICON, 2021
@inproceedings{bib_Stre_2021, AUTHOR = {Saujas Srinivasa Vaduguru, Partho Sarthi, Monojit Choudhury, Dipti Mishra Sharma}, TITLE = {Stress Rules from Surface Forms: Experiments with Program Synthesis}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2021}}
Learning linguistic generalizations from only a few examples is a challenging task. Recent work has shown that program synthesis – a method to learn rules from data in the form of programs in a domain-specific language – can be used to learn phonological rules in highly data-constrained settings. In this paper, we use the problem of phonological stress placement as a case to study how the design of the domain-specific language influences the generalization ability when using the same learning algorithm. We find that encoding the distinction between consonants and vowels results in much better performance, and providing syllable-level information further improves generalization. Program synthesis, thus, provides a way to investigate how access to explicit linguistic information influences what can be learnt from a small number of examples.
Domain Adaptation for Hindi-Telugu Machine Translation using Domain Specific Back Translation
Ala Hema,Vandan Mujadia,Dipti Mishra Sharma
Recent advance in Natural language Processing, RANLP, 2021
@inproceedings{bib_Doma_2021, AUTHOR = {Ala Hema, Vandan Mujadia, Dipti Mishra Sharma}, TITLE = {Domain Adaptation for Hindi-Telugu Machine Translation using Domain Specific Back Translation}, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2021}}
In this paper, we present a novel approach for domain adaptation in Neural Machine Translation which aims to improve the translation quality over a new domain. Adapting new domains is a highly challenging task for Neural Machine Translation on limited data, it becomes even more difficult for technical domains such as Chemistry and Artificial Intelligence due to specific terminology, etc. We propose Domain Specific Back Translation method which uses available monolingual data and generates synthetic data in a different way. This approach uses Out Of Domain words. The approach is very generic and can be applied to any language pair for any domain. We conduct our experiments on Chemistry and Artificial Intelligence domains for Hindi and Telugu in both directions. It has been observed that the usage of synthetic data created by the proposed algorithm improves the BLEU scores significantly.
Low Resource Similar Language Neural Machine Translation for Tamil-Telugu
Vandan Mujadia,Dipti Mishra Sharma
Conference on Machine Translation, WMT, 2021
@inproceedings{bib_Low__2021, AUTHOR = {Vandan Mujadia, Dipti Mishra Sharma}, TITLE = {Low Resource Similar Language Neural Machine Translation for Tamil-Telugu}, BOOKTITLE = {Conference on Machine Translation}. YEAR = {2021}}
This paper describes the participation of team oneNLP (LTRC, IIIT-Hyderabad) for the WMT 2021 task, similar language translation1 . We experimented with transformer based Neural Machine Translation and explored the use of language similarity for Tamil-Telugu and Telugu-Tamil. We incorporated use of different subword configurations, script conversion and single model training for both directions as exploratory experiments.
Sample-efficient Linguistic Generalizations through Program Synthesis: Experiments with Phonology Problems
Saujas Srinivasa Vaduguru,Aalok Sathe,Monojit Choudhury,Dipti Mishra Sharma
SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, SIGMORPHON, 2021
@inproceedings{bib_Samp_2021, AUTHOR = {Saujas Srinivasa Vaduguru, Aalok Sathe, Monojit Choudhury, Dipti Mishra Sharma}, TITLE = {Sample-efficient Linguistic Generalizations through Program Synthesis: Experiments with Phonology Problems}, BOOKTITLE = {SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology}. YEAR = {2021}}
Neural models excel at extracting statistical patterns from large amounts of data, but struggle to learn patterns or reason about language from only a few examples. In this paper, we ask: Can we learn explicit rules that generalize well from only a few examples? We explore this question using program synthesis. We develop a synthesis model to learn phonology rules as programs in a domain-specific language. We test the ability of our models to generalize from few training examples using our new dataset of problems from the Linguistics Olympiad, a challenging set of tasks that require strong linguistic reasoning ability. In addition to being highly sample-efficient, our approach generates human-readable programs, and allows control over the generalizability of the learnt programs.
Assessing Post-editing Effort in the English-Hindi Direction
Arafat Ahsan,Vandan Mujadia,Dipti Mishra Sharma
International Conference on Natural Language Processing., ICON, 2021
@inproceedings{bib_Asse_2021, AUTHOR = {Arafat Ahsan, Vandan Mujadia, Dipti Mishra Sharma}, TITLE = {Assessing Post-editing Effort in the English-Hindi Direction}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2021}}
We present findings from a first in-depth postediting effort estimation study in the EnglishHindi direction along multiple effort indicators. We conduct a controlled experiment involving professional translators, who complete assigned tasks alternately, in a translation from scratch and a post-edit condition. We find that post-editing reduces translation time (by 63%), utilizes fewer keystrokes (by 59%), and decreases the number of pauses (by 63%) when compared to translating from scratch. We further verify the quality of translations thus produced via a human evaluation task in which we do not detect any discernible quality differences.
IIIT Hyderabad Submission To WAT 2021: Efficient Multilingual NMT systems for Indian languages
sourav kumar,Salil Aggarwal,Dipti Mishra Sharma
Workshop on Asian Translation, WAT, 2021
@inproceedings{bib_IIIT_2021, AUTHOR = {Sourav Kumar, Salil Aggarwal, Dipti Mishra Sharma}, TITLE = {IIIT Hyderabad Submission To WAT 2021: Efficient Multilingual NMT systems for Indian languages}, BOOKTITLE = {Workshop on Asian Translation}. YEAR = {2021}}
This paper describes the work and the systems submitted by the IIIT-Hyderbad team (Id: IIITH) in the WAT 2021 (Nakazawa et al., 2021) MultiIndicMT shared task. The task covers 10 major languages of the Indian subcontinent. For the scope of this task, we have built multilingual systems for 20 translation directions namely English-Indic (one-to-many) and Indic-English (many-to-one). Individually, Indian languages are resource poor which hampers translation quality but by leveraging multilingualism and abundant monolingual corpora, the translation quality can be substantially boosted. But the multilingual systems are highly complex in terms of time as well as computational resources. Therefore, we are training our systems by efficiently selecting data that will actually contribute to most of the learning process. Furthermore, we are also exploiting the language relatedness found in between Indian languages. All the comparisons were made using BLEU score and we found that our final multilingual system significantly outperforms the baselines by an average of 11.3 and 19.6 BLEU points for EnglishIndic (en-xx) and Indic-English (xx-en) directions, respectively.
How do different factors Impact the Inter-language Similarity? A Case Study on Indian languages
sourav kumar,Salil Aggarwal,Dipti Mishra Sharma,Radhika Mamidi
Association for Computational Linguistics and the 11th International Joint Conference on Natural Lan, ACL -IJCNLP SRW, 2021
@inproceedings{bib_How__2021, AUTHOR = {Sourav Kumar, Salil Aggarwal, Dipti Mishra Sharma, Radhika Mamidi}, TITLE = {How do different factors Impact the Inter-language Similarity? A Case Study on Indian languages}, BOOKTITLE = {Association for Computational Linguistics and the 11th International Joint Conference on Natural Lan}. YEAR = {2021}}
India is one of the most linguistically diverse nations of the world and is culturally very rich. Most of these languages are somewhat similar to each other on account of sharing a common ancestry or being in contact for a long period of time (Bhattacharyya et al., 2016). Nowadays, researchers are constantly putting efforts in utilizing the language relatedness to improve the performance of various NLP systems such as cross lingual semantic search, machine translation (Kunchukuttan and Bhattacharyya, 2020), sentiment analysis systems, etc. So in this paper, we performed an extensive case study on similarity involving languages of the Indian subcontinent. Language similarity prediction is defined as the task of measuring how similar the two languages are on the basis of their lexical, morphological and syntactic features. In this study, we concentrate only on the approach to calculate lexical similarity between Indian languages by looking at various factors such as size and type of corpus, similarity algorithms, subword segmentation, etc. The main takeaways from our work are: (i) Relative order of the language similarities largely remain the same, regardless of the factors mentioned above, (ii) Similarity within the same language family is higher, (iii) Languages share more lexical features at the subword level.
Multilingual Multi-Domain NMT for Indian Languages
Sourav Kumar Singh,Salil Aggarwal,Dipti Mishra Sharma
Recent advance in Natural language Processing, RANLP, 2021
@inproceedings{bib_Mult_2021, AUTHOR = {Sourav Kumar Singh, Salil Aggarwal, Dipti Mishra Sharma}, TITLE = {Multilingual Multi-Domain NMT for Indian Languages }, BOOKTITLE = {Recent advance in Natural language Processing}. YEAR = {2021}}
India is known as the land of many tongues and dialects. Neural machine translation (NMT) is the current state-of-the-art approach for machine translation (MT) but performs better only with large datasets which Indian languages usually lack, making this approach infeasible. So, in this paper, we address the problem of data scarcity by efficiently training multilingual and multilingual multi domain NMT systems involving languages of the Indian subcontinent. We are proposing the technique for using the joint domain and language tags in a multilingual setup. We draw three major conclusions from our experiments: (i) Training a multilingual system via exploiting lexical similarity based on language family helps in achieving an overall average improvement of 3.25 BLEU points over bilingual baselines, (ii) Technique of incorporating domain information into the language tokens helps multilingual multi-domain system in getting a significant average improvement of 6 BLEU points over the baselines, (iii) Multistage fine-tuning further helps in getting an improvement of 1- 1.5 BLEU points for the language pair of interest.
Multilingual Multi-Domain NMT for Indian languages
sourav kumar,Salil Aggarwal,Dipti Mishra Sharma
Association for Computational Linguistics and the 11th International Joint Conference on Natural Lan, ACL -IJCNLP SRW, 2021
@inproceedings{bib_Mult_2021, AUTHOR = {Sourav Kumar, Salil Aggarwal, Dipti Mishra Sharma}, TITLE = {Multilingual Multi-Domain NMT for Indian languages}, BOOKTITLE = {Association for Computational Linguistics and the 11th International Joint Conference on Natural Lan}. YEAR = {2021}}
India is known as the land of many tongues. There is no single language called Indian. India speaks hundreds of languages and dialects. Some are extinct, while some are still in use with considerable speakers. Despite having a lot of different scripts, most of the Indian languages still share a lot of lexical features which can be utilized to help improve the quality of Multilingual NMT systems trained on them. So, in this paper, we present an extensive study of Multilingual as well as Multilingual Multi Domain NMT involving languages of the Indian subcontinent. We draw four major conclusions from our experiments: (i) Multilingual Multi Domain models can significantly improve the accuracy of all the individual languages within their domains, resulting in improving the overall performance of the Multilingual Multi Domain system, (ii) Encoder representation of different languages based on their family helps Multilingual models gain an average improvement of 3.25 BLEU points, (iii) Our new technique of incorporating domain information into the language tokens results in getting a significant improvement of 6 BLEU points on an average as compared to the baselines, (iv) Multistage Fine-tuning further helps in improvement of (1-1.5) BLEU points.
Fine-grained domain classification using Transformers
Akshat Gahoi,Akshat Chhajer,Dipti Mishra Sharma
International Conference on Natural Language Processing (ICON): TechDOfication:Shared Task, ICON- W, 2020
@inproceedings{bib_Fine_2020, AUTHOR = {Akshat Gahoi, Akshat Chhajer, Dipti Mishra Sharma}, TITLE = {Fine-grained domain classification using Transformers}, BOOKTITLE = {International Conference on Natural Language Processing (ICON): TechDOfication:Shared Task}. YEAR = {2020}}
The introduction of transformers in 2017 and successively BERT in 2018 brought about a revolution in the field of natural language processing. Such models are pretrained on vast amounts of data, and are easily extensible to be used for a wide variety of tasks through transfer learning. Continual work on transformer based architectures has led to a variety of new models with state of the art results. RoBERTa (Liu et al., 2019) is one such model, which brings about a series of changes to the BERT (Devlin et al., 2018) architecture and is capable of producing better quality embeddings at an expense of functionality. In this paper, we attempt to solve the well known text classification task of fine-grained domain classification using BERT and RoBERTa and perform a comparative analysis of the same. We also attempt to evaluate the impact of data preprocessing specially in the context of fine-grained domain classification. The results obtained outperformed all the other models at the ICON TechDOfication 2020 (subtask-2a) Fine-grained domain classification task and ranked first. This proves the effectiveness of our approach.
N-Grams TextRank : A Novel Domain Keyword Extraction Technique
Saransh Rajput,Akshat Gahoi,Manvith Muthukuru Reddy,Dipti Mishra Sharma
International Conference on Natural Language Processing., ICON, 2020
@inproceedings{bib_N-Gr_2020, AUTHOR = {Saransh Rajput, Akshat Gahoi, Manvith Muthukuru Reddy, Dipti Mishra Sharma}, TITLE = {N-Grams TextRank : A Novel Domain Keyword Extraction Technique}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2020}}
The rapid growth of the internet has given us a wealth of information and data spread across the web. However, as the data begins to grow we simultaneously face the grave problem of an Information Explosion. An abundance of data can lead to large scale data management problems as well as the loss of the true meaning of the data. In this paper, we present an advanced domain specific keyword extraction algorithm in order to tackle this problem of paramount importance. Our algorithm is based on a modified version of TextRank(Mihalcea and Tarau, 2004) algorithm - an algorithm based on PageRank(Page et al., 1998) to successfully determine the keywords from a domain specific document. Furthermore, this paper proposes a modification to the traditional TextRank algorithm that takes into account bigrams and trigrams and returns results with an extremely high precision. We observe how the precision and f1-score of this model outperforms other models in many domains and the recall can be easily increased by increasing the number of results without affecting the precision. We also discuss about
Enhanced Urdu Word Segmentation using Conditional Random Fields and Morphological Context Features
Aamir Farhan,Mashrukh Islam,Dipti Mishra Sharma
Widening Natural Language Processing Workshop, WiNLP, 2020
@inproceedings{bib_Enha_2020, AUTHOR = {Aamir Farhan, Mashrukh Islam, Dipti Mishra Sharma}, TITLE = {Enhanced Urdu Word Segmentation using Conditional Random Fields and Morphological Context Features}, BOOKTITLE = {Widening Natural Language Processing Workshop}. YEAR = {2020}}
Word segmentation is a fundamental task for most of the NLP applications. Urdu adopts Nastalique writing style which does not have a concept of space. Furthermore, the inherent non-joining attributes of certain characters in Urdu create spaces within a word while writing in digital format. Thus, Urdu not only has space omission but also space insertion issues which make the word segmentation task challenging. In this paper, we improve upon the results of Zia, Raza and Athar (2018) by using a manually annotated corpus of 19,651 sentences along with morphological context features. Using the Conditional Random Field sequence modeler, our model achieves F1 score of 0.98 for word boundary identification and 0.92 for sub-word boundary identification tasks. The results demonstrated in this paper outperform the state-of- the-art methods.
NMT based Similar Language Translation for Hindi - Marathi
Vandan Mujadia,Dipti Mishra Sharma
Conference on Machine Translation, WMT, 2020
@inproceedings{bib_NMT__2020, AUTHOR = {Vandan Mujadia, Dipti Mishra Sharma}, TITLE = {NMT based Similar Language Translation for Hindi - Marathi}, BOOKTITLE = {Conference on Machine Translation}. YEAR = {2020}}
This paper describes the participation of team F1toF6 (LTRC, IIIT-Hyderabad) for the WMT 2020 task, similar language translation. We experimented with attention based recurrent neural network architecture (seq2seq) for this task. We explored the use of different linguistic features like POS and Morph along with back translation for Hindi-Marathi and Marathi-Hindi machine translation.
Cross-Lingual Transfer for Hindi Discourse Relation Identification
ANIRUDH DAHIYA,Manish Srivastava,Dipti Mishra Sharma
Speech and Dialogue Conference, TSD, 2020
@inproceedings{bib_Cros_2020, AUTHOR = {ANIRUDH DAHIYA, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {Cross-Lingual Transfer for Hindi Discourse Relation Identification}, BOOKTITLE = {Speech and Dialogue Conference}. YEAR = {2020}}
Discourse relations between two textual spans in a document attempt to capture the coherent structure which emerges in language use. Automatic classification of these relations remains a challenging task especially in case of implicit discourse relations, where there is no explicit textual cue which marks the discourse relation. In low resource languages, this motivates the exploration of transfer learning approaches, more particularly the cross-lingual techniques towards discourse relation classification. In this work, we explore various cross-lingual transfer techniques on Hindi Discourse Relation Bank (HDRB), a Penn Discourse Treebank styled dataset for discourse analysis in Hindi and observe performance gains in both zero shot and finetuning settings on the Hindi Discourse Relation Classification task. This is the first effort towards exploring transfer learning for Hindi Discourse relation classification to the best of our knowledge.
MEE: An Automatic Metric for Evaluation Using Embeddings for Machine Translation
Ananya Mukherjee,Ala Hema,Manish Srivastava,Dipti Mishra Sharma
International Conference on Data Science and Advanced Analytics, DSAA, 2020
@inproceedings{bib_MEE:_2020, AUTHOR = {Ananya Mukherjee, Ala Hema, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {MEE: An Automatic Metric for Evaluation Using Embeddings for Machine Translation}, BOOKTITLE = {International Conference on Data Science and Advanced Analytics}. YEAR = {2020}}
We propose MEE, an approach for automatic Machine Translation (MT) evaluation which leverages the similarity between embeddings of words in candidate and reference sentences to assess translation quality. Unigrams are matched based on their surface forms, root forms and meanings which aids to capture lexical, morphological and semantic equivalence. We perform experiments for MT from English to four Indian Languages (Telugu, Marathi, Bengali and Hindi) on a robust dataset comprising simple and complex sentences with good and bad translations. Further, it is observed that the proposed metric correlates better with human judgements than the existing widely used metrics.
Linguistically Informed Hindi-English Neural Machine Translation
Vikrant Goyal,PRUTHWIK MISHRA,Dipti Mishra Sharma
International Conference on Language Resources and Evaluation, LREC, 2020
@inproceedings{bib_Ling_2020, AUTHOR = {Vikrant Goyal, PRUTHWIK MISHRA, Dipti Mishra Sharma}, TITLE = {Linguistically Informed Hindi-English Neural Machine Translation}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2020}}
Hindi-English Machine Translation is a challenging problem, owing to multiple factors including the morphological complexity and relatively free word order of Hindi, in addition to the lack of sufficient parallel training data. Neural Machine Translation (NMT) is a rapidly advancing MT paradigm and has shown promising results for many language pairs, especially in large training data scenarios. To overcome the data sparsity issue caused by the lack of large parallel corpora for Hindi-English, we propose a method to employ additional linguistic knowledge which is encoded by different phenomena depicted by Hindi. We generalize the embedding layer of the state-of-the-art Transformer model to incorporate linguistic features like POS tag, lemma and morph features to improve the translation performance. We compare the results obtained on incorporating this knowledge with the baseline systems and demonstrate significant performance improvements. Although, the Transformer NMT models have a strong efficacy to learn language constructs,we show that the usage of specific features further help in improving the translation performance.
Checkpoint Reranking: An Approach To Select Better Hypothesis For Neural Machine Translation Systems
PANDRAMISH VINAY,Dipti Mishra Sharma
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2020
@inproceedings{bib_Chec_2020, AUTHOR = {PANDRAMISH VINAY, Dipti Mishra Sharma}, TITLE = {Checkpoint Reranking: An Approach To Select Better Hypothesis For Neural Machine Translation Systems}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2020}}
In this paper, we propose a method of reranking the outputs of Neural Machine Translation (NMT) systems. After the decoding process, we select a few last iteration outputs in the training process as the N-best list. After training a Neural Machine Translation (NMT) baseline system, it has been observed that these iteration outputs have an oracle score higher than baseline up to 1.01 BLEU points compared to the last iteration of the trained system.We come up with a ranking mechanism by solely focusing on the decoder’s ability to generate distinct tokens and without the usage of any language model or data. With this method, we achieved a translation improvement up to +0.16 BLEU points over baseline.We also evaluate our approach by applying the coverage penalty to the training process.In cases of moderate coverage penalty, the oracle scores are higher than the final iteration up to +0.99 BLEU points, and our algorithm gives an improvement up to +0.17 BLEU points.With excessive penalty, there is a decrease in translation quality compared to the baseline system. Still, an increase in oracle scores up to +1.30 is observed with the re-ranking algorithm giving an improvement up to +0.15 BLEU points is found in case of excessive penalty.The proposed re-ranking method is a generic one and can be extended to other language pairs as well.
Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages
Vikrant Goyal,sourav kumar,Dipti Mishra Sharma
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2020
@inproceedings{bib_Effi_2020, AUTHOR = {Vikrant Goyal, sourav Kumar, Dipti Mishra Sharma}, TITLE = {Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2020}}
A large percentage of the world’s population speaks a language of the Indian subcontinent,comprising languages from both Indo-Aryan (e.g. Hindi, Punjabi, Gujarati, etc.) and Dravidian (e.g. Tamil, Telugu, Malayalam, etc.) families. A universal characteristic of Indian languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high-quality parallel data, can make developing machine translation (MT) systems for these languages difficult. Neural Machine Translation (NMT) is a rapidly advancing MT paradigm and has shown promising results for many language pairs, especially in large training data scenarios. Since the condition of large parallel corpora is not met for Indian-English language pairs, we present our efforts towards building efficient NMT systems between Indian languages (specifically Indo-Aryan languages) and English via efficiently exploiting parallel data from the related languages. We propose a technique called Unified Transliteration and Subword Segmentation to leverage language similarity while exploiting parallel data from related language pairs. We also propose a Multilingual Transfer Learning technique to leverage parallel data from multiple related languages to assist translation for lowresource language pair of interest. Our experiments demonstrate an overall average improvement of 5 BLEU points over the standard Transformer-based NMT baselines.
A Fully Expanded Dependency Treebank for Telugu
SNEHA NALLANI,Manish Srivastava,Dipti Mishra Sharma
International Conference on Language Resources and Evaluation, LREC, 2020
@inproceedings{bib_A_Fu_2020, AUTHOR = {SNEHA NALLANI, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {A Fully Expanded Dependency Treebank for Telugu}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2020}}
Treebanks are an essential resource for syntactic parsing. The available Paninian dependency treebank(s) for Telugu is annotated only with inter-chunk dependency relations and not all words of a sentence are part of the parse tree. In this paper, we automatically annotate the intra-chunk dependencies in the treebank using a Shift-Reduce parser based on Context Free Grammar rules for Telugu chunks. We also propose a few additional intra-chunk dependency relations for Telugu apart from the ones used in Hindi treebank. Annotating intra-chunk dependencies finally provides a complete parse tree for every sentence in the treebank. Having a fully expanded treebank is crucial for developing end to end parsers which produce complete trees. We present a fully expanded dependency treebank for Telugu consisting of 3220 sentences. In this paper, we also convert the treebank annotated with Anncorra part-of-speech tagset to the latest BIS tagset. The BIS tagset is a hierarchical tagset adopted as a unified part-of-speech standard across all Indian Languages. The final treebank is made publicly available
A Simple and Effective Dependency Parser for Telugu
SNEHA NALLANI,Manish Srivastava,Dipti Mishra Sharma
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2020
@inproceedings{bib_A_Si_2020, AUTHOR = {SNEHA NALLANI, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {A Simple and Effective Dependency Parser for Telugu}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2020}}
We present a simple and effective dependency parser for Telugu, a morphologically rich, free word order language. We propose to replace the rich linguistic feature templates used in the past approaches with a minimal feature function using contextual vector representations. We train a BERT model on the Telugu Wikipedia data and use vector representations from this model to train the parser. Each sentence token is associated with a vector representing the token in the context of that sentence and the feature vectors are constructed by concatenating two token representations from the stack and one from the buffer. We put the feature representations through a feed forward network and train with a greedy transition based approach. The resulting parser has a very simple architecture with minimal feature engineering and achieves state-of-the-art results for Telugu.
LTRC-MT simple & effective Hindi-English neural machine translation systems at WAT 2019
Vikrant Goyal,Dipti Mishra Sharma
Workshop on Asian Translation, WAT, 2019
Abs | | bib Tex
@inproceedings{bib_LTRC_2019, AUTHOR = {Vikrant Goyal, Dipti Mishra Sharma}, TITLE = {LTRC-MT simple & effective Hindi-English neural machine translation systems at WAT 2019}, BOOKTITLE = {Workshop on Asian Translation}. YEAR = {2019}}
This paper describes the Neural Machine Translation systems of IIIT-Hyderabad (LTRC-MT) for WAT 2019 Hindi-English shared task. We experimented with both Recurrent Neural Networks & Transformer architectures. We also show the results of our experiments of training NMT models using additional data via backtranslation.
A Dataset for Semantic Role Labelling of Hindi-English Code-Mixed Tweets
Riya pal,Dipti Mishra Sharma
Linguistic Annotation Workshop, LAW, 2019
@inproceedings{bib_A_Da_2019, AUTHOR = {Riya Pal, Dipti Mishra Sharma}, TITLE = {A Dataset for Semantic Role Labelling of Hindi-English Code-Mixed Tweets}, BOOKTITLE = {Linguistic Annotation Workshop}. YEAR = {2019}}
We present a data set of 1460 Hindi-English code-mixed tweets consisting of 20,949 tokens labelled with Proposition Bank labels marking their semantic roles. We created verb frames for complex predicates present in the corpus and formulated mappings from Paninian dependency labels to Proposition Bank labels. With the help of these mappings and the dependency tree, we propose a baseline rule based system for Semantic Role Labelling of Hindi-English code-mixed data. We obtain an accuracy of 96.74% for Argument Identification and are able to further classify 73.93% of the labels correctly. While there is relevant ongoing research on Semantic Role Labelling (SRL) and on building tools for code-mixed social media data, this is the first attempt at labelling semantic roles in Hindi-English codemixed data, to the best of our knowledge.
Towards Automated Semantic Role Labelling of Hindi-English Code-Mixed Tweets
Riya Pal ,Dipti Mishra Sharma
Workshop on Noisy User-generated Text, W- NUT, 2019
@inproceedings{bib_Towa_2019, AUTHOR = {Riya Pal , Dipti Mishra Sharma}, TITLE = {Towards Automated Semantic Role Labelling of Hindi-English Code-Mixed Tweets}, BOOKTITLE = {Workshop on Noisy User-generated Text}. YEAR = {2019}}
We present a system for automating Semantic Role Labelling of Hindi-English code-mixed tweets. We explore the issues posed by noisy, user generated code-mixed social media data. We also compare the individual effect of various linguistic features used in our system. Our proposed model is a 2-step system for automated labelling which gives an overall accuracy of 84% for Argument Classification, marking a 10% increase over the existing rulebased baseline model. This is the first attempt at building a statistical Semantic Role Labeller for Hindi-English code-mixed data, to the best of our knowledge
LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019
Vikrant Goyal,Dipti Mishra Sharma
Workshop on Asian Translation, WAT, 2019
@inproceedings{bib_LTRC_2019, AUTHOR = {Vikrant Goyal, Dipti Mishra Sharma}, TITLE = {LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019}, BOOKTITLE = {Workshop on Asian Translation}. YEAR = {2019}}
This paper describes the Neural Machine Translation systems of IIIT-Hyderabad (LTRC-MT) for WAT 2019 Hindi-English shared task. We experimented with both Recurrent Neural Networks & Transformer architectures. We also show the results of our experiments of training NMT models using additional data via backtranslation.
Text Cohesion in CQA-Does it Impact Rating?
LALIT MOHAN S,JAHFAR ALI P,Syed Mohd Ali Rizwi,Raghu Babu Reddy Y,Dipti Mishra Sharma
International Conference on Mining Intelligence and Knowledge Exploration, MIKE, 2019
@inproceedings{bib_Text_2019, AUTHOR = {LALIT MOHAN S, JAHFAR ALI P, Syed Mohd Ali Rizwi, Raghu Babu Reddy Y, Dipti Mishra Sharma}, TITLE = {Text Cohesion in CQA-Does it Impact Rating?}, BOOKTITLE = {International Conference on Mining Intelligence and Knowledge Exploration}. YEAR = {2019}}
Community Question and Answer (CQA) platforms are expected to provide relevant content that is not readily available through search engines. With an increase in the number of users and growth of internet, CQA platforms have transitioned from generic to domain specific systems. Expert rating, machine learning and statistical methods are being used for assessing the quality of answers. However, the research on importance of consistency as a quality parameter in the form of text cohesion in CQAs is limited. We extracted 109,113 CQAs from StackExchange related to Information Security of the last 8 years to evaluate text cohesion in answers. An empirical study conducted with 246 participants (Information Security Experts, Software Engineers and Computational Linguists) on the extracted answers stated that lack of text cohesion impacts the rating of answers in CQA. Software Engineers are seekers and viewers of answers, they responded to a survey that lack of text cohesion leads to difficulty in reading and remembering. Information Security Experts providing answers to CQA stated that they need text cohesion for understandability.
The IIIT-H Gujarati-English Machine Translation system for WMT19
Vikrant Goyal,Dipti Mishra Sharma
Conference of the Association of Computational Linguistics, ACL, 2019
@inproceedings{bib_The__2019, AUTHOR = {Vikrant Goyal, Dipti Mishra Sharma}, TITLE = {The IIIT-H Gujarati-English Machine Translation system for WMT19}, BOOKTITLE = {Conference of the Association of Computational Linguistics}. YEAR = {2019}}
This paper describes the Neural Machine Translation system of IIIT-Hyderabad for the Gujarati→English news translation shared task of WMT19. Our system is based on encoder-decoder framework with attention mechanism. We experimented with Multilingual Neural MT models. Our experiments show that Multilingual Neural Machine Translation leveraging parallel data from related language pairs helps in significant BLEU improvements upto 11.5, for low resource language pairs like Gujarati-English.
Classification of Insincere Questions with ML and Neural Approaches
Vandan Mujadia,PRUTHWIK MISHRA,Dipti Mishra Sharma
Forum for Information Retrieval Evaluation, FIRE, 2019
@inproceedings{bib_Clas_2019, AUTHOR = {Vandan Mujadia, PRUTHWIK MISHRA, Dipti Mishra Sharma}, TITLE = {Classification of Insincere Questions with ML and Neural Approaches}, BOOKTITLE = {Forum for Information Retrieval Evaluation}. YEAR = {2019}}
CIQ or Classification of Insincere Question task in FIRE 2019 focuses on differentiating proper information seeking questions from different kinds of insincere questions. As a part of this task, we (team A3-108) submitted different machine learning and neural network based models. Our best performing model which was an ensemble model of gradient boosting, random forest and 3-nearest neighbor classifiers with majority voting. This model could correctly classify 62.37% of the questions and we secured third position in the task.
IIIT-Hyderabad at HASOC 2019: Hate Speech Detection
Dipti Mishra Sharma,Vandan Mujadia,PRUTHWIK MISHRA
Forum for Information Retrieval Evaluation, FIRE, 2019
@inproceedings{bib_IIIT_2019, AUTHOR = {Dipti Mishra Sharma, Vandan Mujadia, PRUTHWIK MISHRA}, TITLE = {IIIT-Hyderabad at HASOC 2019: Hate Speech Detection}, BOOKTITLE = {Forum for Information Retrieval Evaluation}. YEAR = {2019}}
Automatic identification of offensive language in various social media platforms especially Twitter poses a great challenge to the AI community. The repercussions of such writings are hazardous to individuals, communities, organizations and nations. The HASOC shared task attempts for automatic detection of abusive language on Twitter in English, German and Hindi languages. As a part of this task, we (team A3-108) submitted different machine learning and neural network based models for all the languages. Our best performing model was an ensemble model of SVM, Random Forest and Adaboost classifiers with majority voting.
LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019
Vikrant Goyal,Dipti Mishra Sharma
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2019
@inproceedings{bib_LTRC_2019, AUTHOR = {Vikrant Goyal, Dipti Mishra Sharma}, TITLE = {LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2019}}
This paper describes the Neural Machine Translation systems of IIIT-Hyderabad (LTRC-MT) for WAT 2019 Hindi-English shared task. We experimented with both Recurrent Neural Networks & Transformer architectures. We also show the results of our experiments of training NMT models using additional data via backtranslation.
Curriculum Learning Strategies for Hindi-English Codemixed Sentiment Analysis
ANIRUDH DAHIYA,NEERAJ BATTAN,Manish Srivastava,Dipti Mishra Sharma
International Joint Conference on Artificial Intelligence, IJCAI, 2019
@inproceedings{bib_Curr_2019, AUTHOR = {ANIRUDH DAHIYA, NEERAJ BATTAN, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {Curriculum Learning Strategies for Hindi-English Codemixed Sentiment Analysis}, BOOKTITLE = {International Joint Conference on Artificial Intelligence}. YEAR = {2019}}
Sentiment Analysis and other semantic tasks are commonly used for social media textual analysis to gauge public opinion and make sense from the noise on social media. The language used on social media not only commonly diverges from the formal language, but is compounded by codemixing between languages, especially in large multilingual societies like India. Traditional methods for learning semantic NLP tasks have long relied on end to end task specific training, requiring expensive data creation process, even more so for deep learning methods. This challenge is even more severe for resource scarce texts like codemixed language pairs, with lack of well learnt representations as model priors, and task specific datasets can be few and small in quantities to efficiently exploit recent deep learning approaches. To address above challenges, we introduce curriculum learning strategies for semantic tasks in code-mixed Hindi-English (Hi-En) texts, and investigate various training strategies for enhancing model performance. Our method outperforms the state of the art methods for Hi-En codemixed sentiment analysis by 3.31% accuracy, and also shows better model robustness in terms of convergence, and variance in test performance.
IIT(BHU)–IIITH at CoNLL–SIGMORPHON 2018 SharedTask on Universal Morphological Reinflection
Abhishek Sharma,KATRAPATI GANESH SASANK,Dipti Mishra Sharma
Proceedings of the CoNLL-SIGMORPHON, CoNLL–SIGMORPHON, 2018
@inproceedings{bib_IIT(_2018, AUTHOR = {Abhishek Sharma, KATRAPATI GANESH SASANK, Dipti Mishra Sharma}, TITLE = {IIT(BHU)–IIITH at CoNLL–SIGMORPHON 2018 SharedTask on Universal Morphological Reinflection}, BOOKTITLE = {Proceedings of the CoNLL-SIGMORPHON}. YEAR = {2018}}
This paper describes the systems submitted by IIT (BHU), Varanasi/IIIT Hyderabad (IITBHU–IIITH) for Task 1 of CoNLL– SIGMORPHON 2018 Shared Task on Universal Morphological Reinflection (Cotterell et al., 2018). The task is to generate the inflected form given a lemma and set of morphological features. The systems are evaluated on over 100 distinct languages and three different resource settings (low, medium and high). We formulate the task as a sequence to sequence learning problem. As most of the characters in inflected form are copied from the lemma, we use Pointer-Generator Network (See et al., 2017) which makes it easier for the system to copy characters from the lemma. PointerGenerator Network also helps in dealing with out-of-vocabulary characters during inference. Our best performing system stood 4th among 28 systems, 3rd among 23 systems and 4th among 23 systems for the low, medium and high resource setting respectively.
Decision tree ensemble for parts-of-speech tagging of resource-poor languages
G VAMSI KRISHNA,PRATIBHA RANI,Vikram Pudi,Dipti Mishra Sharma
Forum for Information Retrieval Evaluation, FIRE, 2018
@inproceedings{bib_Deci_2018, AUTHOR = {G VAMSI KRISHNA, PRATIBHA RANI, Vikram Pudi, Dipti Mishra Sharma}, TITLE = {Decision tree ensemble for parts-of-speech tagging of resource-poor languages}, BOOKTITLE = {Forum for Information Retrieval Evaluation}. YEAR = {2018}}
Ensemble POS taggers are a good choice to integrate and leverage benefits of various types of POS taggers. This can help the large number (6500+) of resource-poor languages which do not have much annotated training data by providing ways to integrate semi-supervised/unsupervised taggers with supervised taggers. In this paper we present our experiments of developing ensemble POS taggers using a decision tree. We integrate a semi-supervised data mining approach that uses context based lists (CBLs) for POS tagging with supervised (1) Support Vector Machine based POS tagger, called SVMTool and (2) Conditional Random Field based POS tagger. The results are enhanced semi-supervised ensemble POS taggers which outperform the base methods. In these POS taggers, we use a decision tree to decide when to rely on the output of supervised tagger, and when to rely on the semi-supervised CBL method. The CBL based tagger uses rich contextual information which helps in tagging both existing and unseen words and uses no domain knowledge while supervised taggers give good performance for words present in the training model and can include domain based features. Hence, these algorithms have complementary strengths and in our ensemble we are able to combine these strengths. Enhanced performance of our new POS taggers over the base methods suggests that integrating these methods combines the qualities of these in the new tagger which enhances the performance. Therefore, these new semi-supervised ensemble taggers are more suitable for resource-poor languages.
Building a Kannada POS Tagger Using Machine Learning and Neural Network Models
Ketan Kumar Todi,PRUTHWIK MISHRA,Dipti Mishra Sharma
Technical Report, arXiv, 2018
@inproceedings{bib_Buil_2018, AUTHOR = {Ketan Kumar Todi, PRUTHWIK MISHRA, Dipti Mishra Sharma}, TITLE = {Building a Kannada POS Tagger Using Machine Learning and Neural Network Models}, BOOKTITLE = {Technical Report}. YEAR = {2018}}
POS Tagging serves as a preliminary task for many NLP applications. Kannada is a relatively poor Indian language with very limited number of quality NLP tools available for use. An accurate and reliable POS Tagger is essential for many NLP tasks like shallow parsing, dependency parsing, sentiment analysis, named entity recognition. We present a statistical POS tagger for Kannada using different machine learning and neural network models. Our Kannada POS tagger outperforms the state-of-the-art Kannada POS tagger by 6%. Our contribution in this paper is three folds - building a generic POS Tagger, comparing the performances of different modeling techniques, exploring the use of character and word embeddings together for Kannada POS Tagging.
Arithmetic Word Problem Solver using Frame Identification
PRUTHWIK MISHRA,LITTON J KURISINKEL,Dipti Mishra Sharma
Technical Report, arXiv, 2018
@inproceedings{bib_Arit_2018, AUTHOR = {PRUTHWIK MISHRA, LITTON J KURISINKEL, Dipti Mishra Sharma}, TITLE = {Arithmetic Word Problem Solver using Frame Identification}, BOOKTITLE = {Technical Report}. YEAR = {2018}}
Automatic Word problem solving has always posed a great challenge for the NLP community. Usually a word problem is a narrative comprising of a few sentences and a question is asked about a quantity referred in the sentences. Solving word problem involves reasoning across sentences, identification of operations, their order, relevant quantities and discarding irrelevant quantities. In this paper, we present a novel approach for automatic arithmetic word problem solving. Our approach starts with frame identification. Each frame can either be classified as a state or an action frame. The frame identification is dependent on the verb in a sentence. Every frame is unique and is identified by its slots. The slots are filled using dependency parsed output of a sentence. The slots are entity holder, entity, quantity of the entity, recipient, additional information like place, time. The slots and frames helps to identify the type of question asked and the entity referred. Action frames act on state frame(s) which causes a change in quantities of the state frames. The frames are then used to build a graph where any change in quantities can be propagated to the neighboring nodes. Most of the current solvers can only answer questions related to the quantity, while our system can answer different kinds of questions like ‘who’, ‘what’ other than the quantity related questions ‘how many’. There are three major contributions of this paper. 1. Frame Annotated Corpus (with a frame annotation tool) 2. Frame Identification Module 3. A new easily understandable Framework for word problem solving
Automated Error Correction and Validation for POS Tagging of Hindi
Sachi Angle,PRUTHWIK MISHRA,Dipti Mishra Sharma
Pacific Asia Conference on Language, Information and Computation, PACLIC, 2018
@inproceedings{bib_Auto_2018, AUTHOR = {Sachi Angle, PRUTHWIK MISHRA, Dipti Mishra Sharma}, TITLE = {Automated Error Correction and Validation for POS Tagging of Hindi}, BOOKTITLE = {Pacific Asia Conference on Language, Information and Computation}. YEAR = {2018}}
The Part-Of-Speech tag of a word can provide crucial information for a large number of tasks, and so, it is of utmost importance that the POS tagged data is accurate. However, manually checking the data is a tedious and time consuming task. Thus, there is a need for an Automatic Error Correction and Validation model for any POS Tagged Data. In this paper,we work towards achieving the aforementioned goal for Hindi POS Tagging. This is achieved by using an ensemble model consisting of three POS Tagging Models. Based on the predictions made by the three models, and the POS tag present in the dataset, the ensemble model predicts the presence of an error. The POS tagging models explored were the Hidden Markov Model, Support Vector Machine, Conditional Random Fields, Long Short Term Memory (LSTM) Networks, Bidirectional LSTM Networks, and Logistic Regression. A Fully Connected Neural Network was used to build the ensemble model, and it achieved an accuracy of 94.02%.
Universal Dependency Parsing for Hindi-English Code-Switching
IRSHAD AHMAD BHAT,Riyaz Ahmad Bhat,Manish Srivastava,Dipti Mishra Sharma
Conference of the North American Chapter of the Association for Computational Linguistics, NAACL, 2018
@inproceedings{bib_Univ_2018, AUTHOR = {IRSHAD AHMAD BHAT, Riyaz Ahmad Bhat, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {Universal Dependency Parsing for Hindi-English Code-Switching}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics}. YEAR = {2018}}
Code-switching is a phenomenon of mixing grammatical structures of two or more languages under varied social constraints. The code-switching data differ so radically from the benchmark corpora used in NLP community that the application of standard technologies to these data degrades their performance sharply. Unlike standard corpora, these data often need to go through additional processes such as language identification, normalization and/or back-transliteration for their efficient processing. In this paper, we investigate these indispensable processes and other problems associated with syntactic parsing of code-switching data and propose methods to mitigate their effects. In particular, we study dependency parsing of code-switching data of Hindi and English multilingual speakers from Twitter. We present a treebank of Hindi-English code-switching tweets under Universal Dependencies scheme and propose a neural stacking model for parsing that efficiently leverages part-of-speech tag and syntactic tree annotations in the code-switching treebank and the preexisting Hindi and English treebanks. We also present normalization and back-transliteration models with a decoding process tailored for code-switching data. Results show that our neural stacking parser is 1.5% LAS points better than the augmented parsing model and our decoding process improves results by 3.8% LAS points over the first-best normalization and/or backtransliteration.
No more beating about the bush: A Step towards Idiom Handling for Indian Language NLP
Ruchit Agrawal,Vighnesh Chenthil Kumar,Vigneshwaran Muralidharan,Dipti Mishra Sharma
International Conference on Language Resources and Evaluation, LREC, 2018
@inproceedings{bib_No_m_2018, AUTHOR = {Ruchit Agrawal, Vighnesh Chenthil Kumar, Vigneshwaran Muralidharan, Dipti Mishra Sharma}, TITLE = {No more beating about the bush: A Step towards Idiom Handling for Indian Language NLP}, BOOKTITLE = {International Conference on Language Resources and Evaluation}. YEAR = {2018}}
One of the major challenges in the field of Natural Language Processing (NLP) is the handling of idioms; seemingly ordinary phrases which could be further conjugated or even spread across the sentence to fit the context. Since idioms are a part of natural language, the ability to tackle them brings us closer to creating efficient NLP tools. This paper presents a multilingual parallel idiom dataset for seven Indian languages in addition to English and demonstrates its usefulness for two NLP applications - Machine Translation and Sentiment Analysis. We observe significant improvement for both the subtasks over baseline models trained without employing the idiom dataset.
IIT(BHU)–IIITH at CoNLL–SIGMORPHON 2018 Shared Task on Universal Morphological Reinflection
Abhishek Sharma,KATRAPATI GANESH SASANK,Dipti Mishra Sharma
The SIGNLL Conference on Computational Natural Language Learning, CoNLL, 2018
@inproceedings{bib_IIT(_2018, AUTHOR = {Abhishek Sharma, KATRAPATI GANESH SASANK, Dipti Mishra Sharma}, TITLE = {IIT(BHU)–IIITH at CoNLL–SIGMORPHON 2018 Shared Task on Universal Morphological Reinflection}, BOOKTITLE = {The SIGNLL Conference on Computational Natural Language Learning}. YEAR = {2018}}
This paper describes the systems submitted by IIT (BHU), Varanasi/IIIT Hyderabad (IITBHU–IIITH) for Task 1 of CoNLL– SIGMORPHON 2018 Shared Task on Universal Morphological Reinflection (Cotterell et al., 2018). The task is to generate the inflected form given a lemma and set of morphological features. The systems are evaluated on over 100 distinct languages and three different resource settings (low, medium and high). We formulate the task as a sequence to sequence learning problem. As most of the characters in inflected form are copied from the lemma, we use Pointer-Generator Network (See et al., 2017) which makes it easier for the system to copy characters from the lemma. PointerGenerator Network also helps in dealing with out-of-vocabulary characters during inference. Our best performing system stood 4th among 28 systems, 3rd among 23 systems and 4th among 23 systems for the low, medium and high resource setting respectively
EquGener: A Reasoning Network for Word Problem Solving by Generating Arithmetic Equations
PRUTHWIK MISHRA,LITTON J KURISINKEL,Dipti Mishra Sharma,Vasudeva Varma Kalidindi
Pacific Asia Conference on Language, Information and Computation, PACLIC, 2018
@inproceedings{bib_EquG_2018, AUTHOR = {PRUTHWIK MISHRA, LITTON J KURISINKEL, Dipti Mishra Sharma, Vasudeva Varma Kalidindi}, TITLE = {EquGener: A Reasoning Network for Word Problem Solving by Generating Arithmetic Equations}, BOOKTITLE = {Pacific Asia Conference on Language, Information and Computation}. YEAR = {2018}}
Word problem solving has always been a challenging task as it involves reasoning across sentences, identification of operations and their order of application on relevant operands. Most of the earlier systems attempted to solve word problems with tailored features for handling each category of problems. In this paper, we present a new approach to solve simple arithmetic problems. Through this work we introduce a novel method where we first learn a dense representation of the problem description conditioned on the question in hand. We leverage this representation to generate the operands and operators in the appropriate order. Our approach improves upon the state-of-the-art system by 3% in one benchmark dataset while ensuring comparable accuracies in other datasets.
Extractive text summarisation in hindi
SAKSHEE VIJAY, Vartika Rai,Sorabh Gupta, Anshuman Vijayvargi,Dipti Mishra Sharma
International Conference on Asian Language Processing, IALP, 2017
@inproceedings{bib_Extr_2017, AUTHOR = {SAKSHEE VIJAY, Vartika Rai, Sorabh Gupta, Anshuman Vijayvargi, Dipti Mishra Sharma}, TITLE = {Extractive text summarisation in hindi}, BOOKTITLE = {International Conference on Asian Language Processing}. YEAR = {2017}}
With immense amount of data growing on web in Hindi, a text summariser would be helpful in summarising Government data, medical reports, news, and research articles. Hindi is the fourth most-spoken first language in the world. Hindi written in the Devanagari script is the official language of the Government of India. There is no public dataset for extractive summarisation available in Hindi and thus a dataset of 24253 News articles was extracted and the extractive summaries results were evaluated on various parameters with manual gold summaries of exactly 60 words each.
Leveraging Newswire Treebanks for Parsing Conversational Data with Argument Scrambling
Riyaz A. Bhat,IRSHAD AHMAD BHAT,Dipti Mishra Sharma
International Workshop on Parsing Technologies, IWPT, 2017
@inproceedings{bib_Leve_2017, AUTHOR = {Riyaz A. Bhat, IRSHAD AHMAD BHAT, Dipti Mishra Sharma}, TITLE = {Leveraging Newswire Treebanks for Parsing Conversational Data with Argument Scrambling}, BOOKTITLE = {International Workshop on Parsing Technologies}. YEAR = {2017}}
We investigate the problem of parsing conversational data of morphologically-rich languages such as Hindi where argument scrambling occurs frequently. We evaluate a state-of-the-art non-linear transition-based parsing system on a new dataset containing 506 dependency trees for sentences from Bollywood (Hindi) movie scripts and Twitter posts of Hindi monolingual speakers. We show that a dependency parser trained on a newswire treebank is strongly biased towards the canonical structures and degrades when applied to conversational data. Inspired by Transformational Generative Grammar (Chomsky, 1965), we mitigate the sampling bias by generating all theoretically possible alternative word orders of a clause from the existing (kernel) structures in the treebank. Training our parser on canonical and transformed structures improves performance on conversational data by around 9% LAS over the baseline newswire parser.
Semisupervied Data Driven Word Sense Disambiguation for Resource-poor Languages
PRATIBHA RANI,Vikram Pudi,Dipti Mishra Sharma
International Conference on Natural Language Processing., ICON, 2017
@inproceedings{bib_Semi_2017, AUTHOR = {PRATIBHA RANI, Vikram Pudi, Dipti Mishra Sharma}, TITLE = {Semisupervied Data Driven Word Sense Disambiguation for Resource-poor Languages}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2017}}
n this paper, we present a generic semi-supervised Word Sense Disambiguation(WSD) method. Currently, the existingWSD methods extensively use domain re-sources and linguistic knowledge.Ourproposed method extractscontext basedlistsfrom a small sense-tagged and un-tagged training data without using do-main knowledge. Experiments on Hindiand Marathi Tourism and Health domainsshow that it gives good performance with-out using any language specific linguisticinformation except the sense IDs presentin the sense-tagged training set and workswell even with small training data by han-dling the data sparsity issue. Other ad-vantages are that domain expertise is notneeded for crafting and selecting featuresto build the WSD model and it can handlethe problem of non availability of match-ing contexts in sense-tagged training set.It also finds sense IDs of those test wordswhich are not present in sense-tag
Linguistic approach based Transfer Learning for Sentiment Classification in Hindi
VARTIKA RAI,SAKSHEE VIJAY,Dipti Mishra Sharma
International Conference on Natural Language Processing., ICON, 2017
@inproceedings{bib_Ling_2017, AUTHOR = {VARTIKA RAI, SAKSHEE VIJAY, Dipti Mishra Sharma}, TITLE = {Linguistic approach based Transfer Learning for Sentiment Classification in Hindi}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2017}}
Sentiment analysis in a resource scarcelanguage is a tedious task. We pro-pose a novel method for transfer learn-ing from a target language to English.Our system doesn’t rely on labeleddata for the target language but in-stead links itself onto already existingand extensively labeled word-level lex-ical resource in English (ESWN1) anda semantic parser. Our proposed sys-tem transparently needs no target lan-guage sentiment corpus, and exploitscomplex linguistic structure of the tar-get language for sentiment prediction.This cross lingual approach gives netaccuracy as 83.6%, an improvement of5.4% over the baseline system.
POS Tagging For Resource Poor Languages Through Feature Projection
PRUTHWIK MISHRA,Vandan Mujadia,Dipti Mishra Sharma
International Conference on Natural Language Processing., ICON, 2017
@inproceedings{bib_POS__2017, AUTHOR = {PRUTHWIK MISHRA, Vandan Mujadia, Dipti Mishra Sharma}, TITLE = {POS Tagging For Resource Poor Languages Through Feature Projection}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2017}}
We present an approach for POS tag-ging with out any labeled data. Ourmethod requires translated sentencesfrom a pair of languages. We usedfeature transfer from a resource richlanguage to resource poor languages.Across 8 different Indian Languages,we achieved encouraging accuracieswithout any knowledge of the targetlanguage and any human annotation.This will help us in creating annotatedcorpora for
A vis-à-vis evaluation of MT paradigms for linguistically distant languages
AGRAWAL RUCHIT RAJESHKUMAR,JAHFAR ALI P,Dipti Mishra Sharma
International Conference on Natural Language Processing., ICON, 2017
@inproceedings{bib_A_vi_2017, AUTHOR = {AGRAWAL RUCHIT RAJESHKUMAR, JAHFAR ALI P, Dipti Mishra Sharma}, TITLE = {A vis-à-vis evaluation of MT paradigms for linguistically distant languages}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2017}}
Neural Machine Translation is emerg-ing as the de facto standard for Ma-chine Translation across the globe.Statistical Machine Translation hasbeen the state-of-the-art for translationamong Indian languages. This paperprobes into the effectiveness of NMTfor Indian languages and compares thestrengths and weaknesses of NMT withSMT through a vis-a-vis qualitative es-timation on different linguistic param-eters. We compare the outputs ofboth models for the languages English,Malayalam and Hindi; and test themon various linguistic parameters. Weconclude that NMT works better inmost of the settings, however there isstill immense scope for the bettermentof accuracy for translation of IndianLanguages. We describe the challengesfaces especially when dealing with lan-guages from different language families
Three-phase training to address data sparsity in Neural Machine Translation
AGRAWAL RUCHIT RAJESHKUMAR,MIHIR SHEKHAR,Dipti Mishra Sharma
International Conference on Natural Language Processing., ICON, 2017
@inproceedings{bib_Thre_2017, AUTHOR = {AGRAWAL RUCHIT RAJESHKUMAR, MIHIR SHEKHAR, Dipti Mishra Sharma}, TITLE = {Three-phase training to address data sparsity in Neural Machine Translation}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2017}}
Data sparsity is a key problem incontemporary neural machine transla-tion (NMT) techniques, especially forresource-scarce language pairs. NMTmodels when coupled with large,high quality parallel corpora providepromising results and are an emergingalternative to phrase-based StatisticalMachine Translation (SMT) systems.A solution to overcome data sparsitycan facilitate leveraging of NMT mod-els across language pairs, thereby pro-viding high quality translations de-spite the lack of large parallel cor-pora. In this paper, we demonstrate athree-phase integrated approach whichcombines weakly supervised and semi-supervised learning with NMT tech-niques to build a robust model usinga limited amount of parallel data. Weconduct experiments for five languagepairs (thereby generating ten systems)and our results show a substantial in-crease in translation quality over abaseline NMT model trained only
Joining Hands: Exploiting Monolingual Treebanks for Parsing of Code-mixing Data
IRSHAD AHMAD BHAT,RIYAZ AHMAD BHAT,Manish Srivastava,Dipti Mishra Sharma
Conference of the European Chapter of the Association for Computational Linguistics (EACL), EACL, 2017
@inproceedings{bib_Join_2017, AUTHOR = {IRSHAD AHMAD BHAT, RIYAZ AHMAD BHAT, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {Joining Hands: Exploiting Monolingual Treebanks for Parsing of Code-mixing Data}, BOOKTITLE = {Conference of the European Chapter of the Association for Computational Linguistics (EACL)}. YEAR = {2017}}
In this paper, we propose efficient and less resource-intensive strategies for parsing of code-mixed data. These strategies are not constrained by in-domain annotations, rather they leverage pre-existing monolingual annotated resources for training. We show that these methods can produce significantly better results as compared to an informed baseline. Due to lack of an evaluation set for code-mixed structures, we also present a data set of 450 Hindi and English code-mixed tweets of Hindi multilingual speakers for evaluation.
Unity in Diversity: A Unified Parsing Strategy for Major Indian Languages
JUHI TANDON,Dipti Mishra Sharma
International Conference on Dependency Linguistics, Depling, 2017
@inproceedings{bib_Unit_2017, AUTHOR = {JUHI TANDON, Dipti Mishra Sharma}, TITLE = {Unity in Diversity: A Unified Parsing Strategy for Major Indian Languages}, BOOKTITLE = {International Conference on Dependency Linguistics}. YEAR = {2017}}
Thispaperpresentsourworktoapplynonlinearneuralnetworkforparsingfiver esourcep oorI ndianL anguagesbe-longingtotwomajorlanguagefamilies-Indo-AryanandDravidian.BengaliandMarathiareIndo-AryanlanguageswhereasKannada,TeluguandMalayalambelongtotheDravidianfamily.WhilelittleworkhasbeendonepreviouslyonBengaliandTelugulineartransition-basedparsing,wepresentoneofthefirstparsersforMarathi,KannadaandMalayalam.AlltheIndianlanguagesarefreewordorderandrangefrombeingmoderatetoveryrichinmorphology.Thereforeinthisworkweproposetheusageoflinguisticallymo-tivatedmorphologicalfeatures(suffixandpostposition)inthenonlinearframework,tocapturetheintricaciesofboththelan-guagefamilies.Wealsocapturechunkandgender,number,personinformationelegantlyinthismodel.Weputforwardwaystorepresentthesefeaturescosteffe
Improving Transition-Based Dependency Parsing of Hindi and Urdu by Modeling Syntactically Relevant Phenomena
RIYAZ AHMAD BHAT,IRSHAD AHMAD BHAT,Dipti Mishra Sharma
ACM Trasactions on Asian and Low Resource Language Information Processing, TALLIP, 2017
@inproceedings{bib_Impr_2017, AUTHOR = {RIYAZ AHMAD BHAT, IRSHAD AHMAD BHAT, Dipti Mishra Sharma}, TITLE = {Improving Transition-Based Dependency Parsing of Hindi and Urdu by Modeling Syntactically Relevant Phenomena}, BOOKTITLE = {ACM Trasactions on Asian and Low Resource Language Information Processing}. YEAR = {2017}}
Hindi and Urdu are spoken primarily in northern India and Pakistan; together, they constitute the third largest language spoken in the world. 1 They are two standard- ized registers of what has been called the Hindustani language, which belongs to the Indo-Aryan language family. Masica [1993] states that, while they are different languages officially, they are not even different dialects or subdialects in a linguistic sense. Rather, they are different literary styles based on the same linguistically defined
Deep Neural Network based system for solving Arithmetic Word problems
Purvanshi Mehta,PRUTHWIK MISHRA,ATHAVALE VINAYAK SANJAY,Manish Srivastava,Dipti Mishra Sharma
International Joint Conference on Natural Language Processing, IJCNLP, 2017
@inproceedings{bib_Deep_2017, AUTHOR = {Purvanshi Mehta, PRUTHWIK MISHRA, ATHAVALE VINAYAK SANJAY, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {Deep Neural Network based system for solving Arithmetic Word problems}, BOOKTITLE = {International Joint Conference on Natural Language Processing}. YEAR = {2017}}
This paper presents DILTON, a system which solves simple arithmetic word problems. DILTON first predicts the operation that is to be performed (’-’,’+’,’*’,’/’) through a deep neural network based model and then uses it to generate the answer. DILTON divides the question into two parts - world state and query as shown in Figure 1. The world state and the query are processed separately in two different networks and finally the networks are merged to predict the final operation. DILTON learns to predict operations with 8.81 % in a corpus of primary school questions. With simple similarity between the contexts of quantities appearing in the problem and the question text, we are able to identify 92.25 % of relevant quantities and solve 81% of the questions. Our code and data is publicly available
Explicit Argument Identification for Discourse Parsing In Hindi: A Hybrid Pipeline
ROHIT JAIN,Dipti Mishra Sharma
NAACL Student Research Workshop, NAACL-SRW, 2016
@inproceedings{bib_Expl_2016, AUTHOR = {ROHIT JAIN, Dipti Mishra Sharma}, TITLE = {Explicit Argument Identification for Discourse Parsing In Hindi: A Hybrid Pipeline}, BOOKTITLE = {NAACL Student Research Workshop}. YEAR = {2016}}
Shallow discourse parsing enables us to study discourse as a coherent piece of information rather than a sequence of clauses, sentences and paragraphs. In this paper, we identify arguments of explicit discourse relations in Hindi. This is the first such work carried out for Hindi. Building upon previous work carried out on discourse connective identification in Hindi, we propose a hybrid pipeline which makes use of both sub-tree extraction and linear tagging approaches. We report state-ofthe-art performance for this task.
Anuvaad Pranaali: A RESTful API for Machine Translation
NEHAL JAGDISH WANI,SHARADA PRASANNA MOHANTY,Venkata Suresh Reddy Purini,Dipti Mishra Sharma
International Conference on Service-Oriented Computing, ICSOC, 2016
@inproceedings{bib_Anuv_2016, AUTHOR = {NEHAL JAGDISH WANI, SHARADA PRASANNA MOHANTY, Venkata Suresh Reddy Purini, Dipti Mishra Sharma}, TITLE = {Anuvaad Pranaali: A RESTful API for Machine Translation}, BOOKTITLE = {International Conference on Service-Oriented Computing}. YEAR = {2016}}
The current web APIs are end-user centric as they mostly focus on the end results. In this paper, we break this paradigm for one class of scientific workflow problems—machine translation, by designing an API that caters not only to the end users but also allows researchers to find bugs in their systems by exposing the ability to programmatically manipulate the results. Moreover, it follows an easy to replicate workflow based mechanism, which is built on the concept of microservices
A House United: Bridging the Script and Lexical Barrier between Hindi and Urdu
RIYAZ AHMAD BHAT,IRSHAD AHMAD BHAT,NAMAN JAIN,Dipti Mishra Sharma
International Conference on Computational Linguistics, COLING, 2016
@inproceedings{bib_A_Ho_2016, AUTHOR = {RIYAZ AHMAD BHAT, IRSHAD AHMAD BHAT, NAMAN JAIN, Dipti Mishra Sharma}, TITLE = {A House United: Bridging the Script and Lexical Barrier between Hindi and Urdu}, BOOKTITLE = {International Conference on Computational Linguistics}. YEAR = {2016}}
In Computational Linguistics, Hindi and Urdu are not viewed as a monolithic entity and have received separate attention with respect to their text processing. From part-of-speech tagging to machine translation, models are separately trained for both Hindi and Urdu despite the fact that they represent the same language. The reasons mainly are their divergent literary vocabularies and separate orthographies, and probably also their political status and the social perception that they are two separate languages. In this article, we propose a simple but efficient approach to bridge the lexical and orthographic differences between Hindi and Urdu texts. With respect to text processing, addressing the differences between the Hindi and Urdu texts would be beneficial in the following ways: (a) instead of training separate models, their individual resources can be augmented to train single, unified models for better generalization, and (b) their individual text processing applications can be used interchangeably under varied resource conditions. To remove the script barrier, we learn accurate statistical transliteration models which use sentence-level decoding to resolve word ambiguity. Similarly, we learn cross-register word embeddings from the harmonized Hindi and Urdu corpora to nullify their lexical divergences. As a proof of the concept, we evaluate our approach on the Hindi and Urdu dependency parsing under two scenarios: (a) resource sharing, and (b) resource augmentation. We demonstrate that a neural network-based dependency parser trained on augmented, harmonized Hindi and Urdu resources performs significantly better than the parsing models trained separately on the individual resources. We also show that we can achieve near state-of-the-art results when the parsers are used interchangeably.
Comparative Error Analysis Of Parser Outputs On Telugu Dependency Treebank
KANNEGANTI SILPA,HIMANI CHAUDHRY,Dipti Mishra Sharma
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2016
@inproceedings{bib_Comp_2016, AUTHOR = {KANNEGANTI SILPA, HIMANI CHAUDHRY, Dipti Mishra Sharma}, TITLE = {Comparative Error Analysis Of Parser Outputs On Telugu Dependency Treebank}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2016}}
We present a comparative error analysis of two parsers - MALT and MST on Telugu Dependency Treebank data. MALT and MST are currently two of the most dominant data-driven dependency parsers. We discuss the performances of both the parsers in relation to Telugu language. we also talk in detail about both the algorithmic issues of the parsers as well as the language specific constraints of Telugu.The purpose is, to better understand how to help the parsers deal with complex structures, make sense of implicit language specific cues and build a more informed Treebank.
Construction Grammar Based Annotation Framework for Parsing Tamil
Vigneshwaran Muralidaran,Dipti Mishra Sharma
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2016
@inproceedings{bib_Cons_2016, AUTHOR = {Vigneshwaran Muralidaran, Dipti Mishra Sharma}, TITLE = {Construction Grammar Based Annotation Framework for Parsing Tamil}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2016}}
Syntactic parsing in NLP is the task of working out the grammatical structure of sentences. Some of the purely formal approaches to parsing such as phrase structure grammar, dependency grammar have been successfully employed for a variety of languages. While phrase structure based constituent analysis is possible for fixed order languages such as English, dependency analysis between the grammatical units have been suitable for many free word order languages. These approaches rely on identifying the linguistic units based on their formal syntactic properties and establishing the relationships between such units in the form of a tree. Instead, we characterize every morphosyntactic unit as a mapping between form and function on the lines of Construction Grammar and parsing as identification of dependency relations between such conceptual units. Our approach to parser annotation shows an average MALT LAS score of 82.21% on Tamil gold annotated corpus of 935 sentences in a five-fold validation experiment.
Conversion from Paninian Karakas to Universal Dependencies for Hindi Dependency Treebank
JUHI TANDON,HIMANI CHAUDHRY,RIYAZ AHMAD BHAT,Dipti Mishra Sharma
Linguistic Annotation Workshop, LAW, 2016
@inproceedings{bib_Conv_2016, AUTHOR = {JUHI TANDON, HIMANI CHAUDHRY, RIYAZ AHMAD BHAT, Dipti Mishra Sharma}, TITLE = {Conversion from Paninian Karakas to Universal Dependencies for Hindi Dependency Treebank}, BOOKTITLE = {Linguistic Annotation Workshop}. YEAR = {2016}}
The Linguistic Annotation Workshop (LAW) is organized annually by the Association for Computational Linguistics’ Special Interest Group for Annotation (ACL SIGANN). It provides a forum to facilitate the exchange and propagation of research results concerned with the annotation, manipulation, and exploitation of corpora; work towards harmonisation and interoperability from the perspective of the increasingly large number of tools and frameworks for annotated language resources; and work towards a consensus on all issues crucial to the advancement of the field of corpus annotation. The series is now in its tenth year, with these proceedings including papers that were presented at LAW X, held in conjunction with the annual meeting of the Association for Computational Linguistics (ACL) in Berlin, Germany, on August 11, 2016.
Significance of an Accurate Sandhi-Splitter in Shallow Parsing of Dravidian Languages
DEVADATH V V,Dipti Mishra Sharma
Student Research Workshop, SRW, 2016
@inproceedings{bib_Sign_2016, AUTHOR = {DEVADATH V V, Dipti Mishra Sharma}, TITLE = {Significance of an Accurate Sandhi-Splitter in Shallow Parsing of Dravidian Languages}, BOOKTITLE = {Student Research Workshop}. YEAR = {2016}}
This paper evaluates the challenges involved in shallow parsing of Dravidian languages which are highly agglutinative and morphologically rich. Text processing tasks in these languages are not trivial because multiple words concatenate to form a single string with morpho-phonemic changes at the point of concatenation. This phenomenon known as Sandhi, in turn complicates the individual word identification. Shallow parsing is the task of identification of correlated group of words given a raw sentence. The current work is an attempt to study the effect of Sandhi in building shallow parsers for Dravidian languages by evaluating its effect on Malayalam, one of the main languages from Dravidian family. We provide an in-depth analysis of effect of Sandhi in developing a robust shallow parser pipeline with experimental results emphasizing on how sensitive the individual components of shallow parser are, towards the accuracy of a sandhi splitter. Our work can serve as a guiding light for building robust text processing systems in Dravidian languages.
Analyzing English Phrases from Paninian Perspective
Akshar Bharati,SUKHADA,Dipti Mishra Sharma
Research in Computing Science, RCS, 2016
@inproceedings{bib_Anal_2016, AUTHOR = {Akshar Bharati, SUKHADA, Dipti Mishra Sharma}, TITLE = {Analyzing English Phrases from Paninian Perspective}, BOOKTITLE = {Research in Computing Science}. YEAR = {2016}}
This paper explores P¯an. inian Grammar (PG) as an information processing device in terms of ‘how’, ‘how much’ and ‘where’ languages encode information. PG is based on a morphologically rich language, Sanskrit. We apply PG on English and see how the P¯an. inian perspective would deal with it from the information theoretical point of view and its effectiveness in machine translation. We analyze English phrases defining sup (nominal inflections) and ti ˙n (finite verb inflections) and compare them with the notion of pada (an inflected word form) and samasta-pada (compound) in Sanskrit. Sanskrit encodes relations between nouns and adjectives and nouns in apposition through agreement between gender, number and case markers, whereas English encodes them through positions. As a result, constituents are formed. It appears that an English phrase contains more than one pada, hence, cannot be similar to a pada. However, we show the linguistic similarities between a pada, samasta-pada and ‘phrase’.
A semi-supervised associative classification method for POS tagging
PRATIBHA RANI,Vikram Pudi,Dipti Mishra Sharma
International Journal of Data Science and Analytics, IJDSA, 2016
@inproceedings{bib_A_se_2016, AUTHOR = {PRATIBHA RANI, Vikram Pudi, Dipti Mishra Sharma}, TITLE = {A semi-supervised associative classification method for POS tagging}, BOOKTITLE = {International Journal of Data Science and Analytics}. YEAR = {2016}}
We present here a data mining approach for part-of-speech (POS) tagging, an important natural language processing (NLP) task, which is a classification problem. We propose a semi-supervised associative classification method for POS tagging. Existing methods for building POS taggers require extensive domain and linguistic knowledge and resources. Our method uses a combination of a small POS tagged corpus and untagged text data as training data to build the classifier model using association rules. Our tagger works well with very little training data also. The use of semi-supervised learning provides the advantage of not requiring a large high-quality annotated corpus. These properties make it especially suitable for resource-poor languages. Our experiments on various resource-rich, resource-moderate and resource-poor languages show good performance without using any language-specific linguistic information. We note that inclusion of such features in our method may further improve the performance. Results also show that for smaller training data sizes our tagger performs better than state-of-the-art conditional random field (CRF) tagger using same features as our tagger.
Pronominal Reference Type Identification and Event Anaphora Resolution for Hindi
Vandan Mujadia,Palash Gupta,Dipti Mishra Sharma
International Journal of Computational Linguistics and Applications, IJCLA, 2016
@inproceedings{bib_Pron_2016, AUTHOR = {Vandan Mujadia, Palash Gupta, Dipti Mishra Sharma}, TITLE = {Pronominal Reference Type Identification and Event Anaphora Resolution for Hindi}, BOOKTITLE = {International Journal of Computational Linguistics and Applications}. YEAR = {2016}}
In this paper, we present hybrid approaches for pronominal reference type (abstract or concrete) identification and event anaphora resolution for Hindi. Pronominal reference type identification is one of the important parts for any anaphora resolution system as it helps anaphora resolver in optimal feature selection based on pronominal reference types. We use language specific rules and features in set of classifiers (ensemble learning) for pronominal type identification. We discuss event referring anaphors (pronouns) and their resolution using Paninian dependency grammar, language syntax, proximity of events, etc. We achieved around 9̃0% accuracy in the pronominal reference type identification and around 7̃1% F-score in the event anaphora resolution on Hindi dependency tree-bank corpus.
Kathaa: A visual programming framework for nlp applications
SHARADA PRASANNA MOHANTY,NEHAL JAGDISH WANI,Manish Srivastava,Dipti Mishra Sharma
Conference of the North American Chapter of the Association for Computational Linguistics, NAACL, 2016
@inproceedings{bib_Kath_2016, AUTHOR = {SHARADA PRASANNA MOHANTY, NEHAL JAGDISH WANI, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {Kathaa: A visual programming framework for nlp applications}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics}. YEAR = {2016}}
In this paper, we present Kathaa1, anopen source web based Visual Programming Framework for NLP applications. It supports design, execution and analysis of complex NLP systems by choosing and visually connecting NLP modules from an already avail-able and easily extensible Module library. It models NLP systems as a Directed Acyclic Graph of optionally parallalized information flow, and lets the user choose and use avail-able modules in their NLP applications irrespective of their technical proficiency. Kathaa exposes a precise Module definition API to al-low easy integration of external NLP components (along with their associated services as docker containers), it allows everyone to publish their services in a standardized format for everyone else to use it out of the box.
Kathaa: NLP Systems as Edge-Labeled Directed Acyclic MultiGraphs
SHARADA PRASANNA MOHANTY,NEHAL JAGDISH WANI,Manish Srivastava,Dipti Mishra Sharma
International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infr, OIAF4HLT | WS, 2016
@inproceedings{bib_Kath_2016, AUTHOR = {SHARADA PRASANNA MOHANTY, NEHAL JAGDISH WANI, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {Kathaa: NLP Systems as Edge-Labeled Directed Acyclic MultiGraphs}, BOOKTITLE = {International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infr}. YEAR = {2016}}
We present Kathaa, an Open Source web-based Visual Programming Framework for Natural Language Processing (NLP) Systems. Kathaa supports the design, execution and analysis of complex NLP systems by visually connecting NLP components from an easily extensible Module Library. It models NLP systems an edge-labeled Directed Acyclic MultiGraph, and lets the user use publicly co-created modules in their own NLP applications irrespective of their technical proficiency in Natural Language Processing. Kathaa exposes an intuitive web based Interface forthe users to interact with and modify complex NLP Systems; and a precise Module definition API to allow easy integration of new state of the art NLP components. Kathaa enables researchers to publish their services in a standardized format to enable the masses to use their services out ofthe box. The vision of this work is to pave the way for a system like Kathaa, to be the Lego blocks of NLP Research and Applications. As a practical use case we use Kathaa to visually implement the Sampark Hindi-Panjabi Machine Translation Pipeline and the Sampark Hindi-Urdu Machine Translation Pipeline, to demonstrate the fact that Kathaa can handle really complex NLP systems while still being intuitive for the end user.
Shallow parsing pipeline for hindi-english code-mixed social media text
ARNAV SHARMA,SAKSHI GUPTA,RAVEESH MOTLANI,PIYUSH BANSAL,Manish Srivastava,Radhika Mamidi,Dipti Mishra Sharma
Conference of the North American Chapter of the Association for Computational Linguistics, NAACL, 2016
@inproceedings{bib_Shal_2016, AUTHOR = {ARNAV SHARMA, SAKSHI GUPTA, RAVEESH MOTLANI, PIYUSH BANSAL, Manish Srivastava, Radhika Mamidi, Dipti Mishra Sharma}, TITLE = {Shallow parsing pipeline for hindi-english code-mixed social media text}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics}. YEAR = {2016}}
n this study, the problem of shallow parsing of Hindi-English code-mixed social media text (CSMT) has been addressed. We have annotated the data, developed a language identifier, a normalizer, a part-of-speech tagger and a shallow parser. To the best of our knowledge, we are the first to attempt shallow parsing on CSMT. The pipeline developed has been made available to the research community with the goal of enabling better text analysis of Hindi English CSMT. The pipeline is accessible at this http URL.
Non-decreasing sub-modular function for comprehensible summarization
LITTON J KURISINKEL,PRUTHWIK MISHRA,VIGNESHWARAN M,Vasudeva Varma Kalidindi,Dipti Mishra Sharma
Conference of the North American Chapter of the Association for Computational Linguistics Workshops, NAACL-W, 2016
@inproceedings{bib_Non-_2016, AUTHOR = {LITTON J KURISINKEL, PRUTHWIK MISHRA, VIGNESHWARAN M, Vasudeva Varma Kalidindi, Dipti Mishra Sharma}, TITLE = {Non-decreasing sub-modular function for comprehensible summarization}, BOOKTITLE = {Conference of the North American Chapter of the Association for Computational Linguistics Workshops}. YEAR = {2016}}
Extractive summarization techniques typically aim to maximize the information coverage of the summary with respect to the original corpus and report accuracies in ROUGE scores. Automated text summarization techniques should consider the dimensions of comprehensibility, coherence and readability. In the current work, we identify the discourse structure which provides the context for the creation of a sentence. We leverage the information from the structure to frame a monotone (non-decreasing) sub-modular scoring function for generating comprehensible summaries. Our approach improves the overall quality of comprehensibility of the summary in terms of human evaluation and gives sufficient content coverage with comparable ROUGE score. We also formulate a metric to measure summary comprehensibility in terms of Contextual Independence of a sentence. The metric is shown to be representative of human judgement of text comprehensibility.
Developing Part-of-Speech Tagger for a Resource Poor Language :Sindhi
RAVEESH MOTLANI,Harsh Lalwani,Manish Srivastava,Dipti Mishra Sharma
Conference on Language and Technology,, CLT, 2015
@inproceedings{bib_Deve_2015, AUTHOR = {RAVEESH MOTLANI, Harsh Lalwani, Manish Srivastava, Dipti Mishra Sharma}, TITLE = {Developing Part-of-Speech Tagger for a Resource Poor Language :Sindhi}, BOOKTITLE = {Conference on Language and Technology,}. YEAR = {2015}}
Sindhi is an Indo-Aryan language spoken by more than 58 million speakers around the world. It is currently a resource poor language which is harmed by the literature being written in multiple scripts. Though the language is widely spoken,primarily, across two countries, the written form is not standardized. In this paper, we seek to develop resources for basic language processing for Sindhi language, in one of its preferred scripts (Devanagari), because a language that seeks to survive in the modern information society requires language technology products. This paper presents our work on building a stochastic Part-of-Speech tagger for Sindhi-Devanagari using conditional random fields with linguistically motivated features. The paper also discusses the steps taken to construct a part-of-speech annotated corpus for Sindhi in Devanagari script. We have also explained in detail the features that were used for training the tagger, which resulted in a part of speech tagger nearing 92% average accuracy.
Applying Sanskrit Concepts for Reordering in MT
Akshar Bharati,SUKHADA,Prajna Jha,Soma Paul,Dipti Mishra Sharma
International Conference on Natural Language Processing., ICON, 2015
@inproceedings{bib_Appl_2015, AUTHOR = {Akshar Bharati, SUKHADA, Prajna Jha, Soma Paul, Dipti Mishra Sharma}, TITLE = {Applying Sanskrit Concepts for Reordering in MT}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2015}}
This paper presents a rule-based reorder-ing approach for English-Hindi machinetranslation. We have used the concept ofpada,from P ̄an.inian Grammar to framethe reordering rules. Apadais a wordform which is ready to participate in asentence. The rules are generic enoughto apply on any English-Indian languagepair. We tested the rules on English-Hindilanguage pair and obtained better compre-hensibility score as compared to GoogleTranslate on the same test set. In assess-ing the effectiveness of the rules on padaswhich are analogous to minimal phrases inEnglish, we achieved upto 93% accuracyon the test data.
Paninian Grammar Based Hindi Dialogue Anaphora Resolution
Vandan Mujadia,DARSHAN AGARWAL,Radhika Mamidi,Dipti Mishra Sharma
International Conference on Asian Language Processing, IALP, 2015
@inproceedings{bib_Pani_2015, AUTHOR = {Vandan Mujadia, DARSHAN AGARWAL, Radhika Mamidi, Dipti Mishra Sharma}, TITLE = {Paninian Grammar Based Hindi Dialogue Anaphora Resolution}, BOOKTITLE = {International Conference on Asian Language Processing}. YEAR = {2015}}
In this paper, we present a Paninian grammar based heuristic model1 to resolve entity-pronoun references in Hindi dialogue. We explore the use of Paninian based dependency structures as a source of syntactico-semantic information. Our experiments illustrate that the use of dependency and dialogue structures help to resolve specific types of references. We also show that named entity, discourse information like subtopic boundary and animacy features increase the overall resolution accuracy to 64% for user-user interaction data and 59% for play-story corpora.
Readable and Coherent MultiDocument Summarization.
LITTON J KURISINKEL,VIGNESHWARAN M,Vasudeva Varma Kalidindi,Dipti Mishra Sharma
International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, 2015
@inproceedings{bib_Read_2015, AUTHOR = {LITTON J KURISINKEL, VIGNESHWARAN M, Vasudeva Varma Kalidindi, Dipti Mishra Sharma}, TITLE = {Readable and Coherent MultiDocument Summarization.}, BOOKTITLE = {International Conference on Intelligent Text Processing and Computational Linguistics}. YEAR = {2015}}
Extractive summarization is the process of precisely choosing a set of sentences from a corpus which can actually be a representative of the original corpus in a limited space. In addition to exhibiting a good content coverage, the final summary should be readable as well as structurally and topically coherent. In this paper we present a holistic, multi-document summarization approach which takes care of the content coverage, sentence ordering, maintenance of topical coherence, topical order and inter-sentence structural relationships. To achieve this we have introduced a novel concept of a Local Coherent Unit (LCU). Our results are comparable with the peer systems for content coverage and sentence ordering measured in terms of ROUGE and τ score respectively. The human evaluation preference for readability and coherence of summary are significantly better for our approach vis a vis other approaches. The approach is scalable to bigger realtime corpus as well.
Exploring the effects of sentence simplification on Hindi to English machine translation system
KSHITIJ MISHRA,ANKUSH SONI,Rahul Sharma,Dipti Mishra Sharma
Conference of the Association for Computational Linguistics Workshops, ACL-W, 2014
@inproceedings{bib_Expl_2014, AUTHOR = {KSHITIJ MISHRA, ANKUSH SONI, Rahul Sharma, Dipti Mishra Sharma}, TITLE = {Exploring the effects of sentence simplification on Hindi to English machine translation system}, BOOKTITLE = {Conference of the Association for Computational Linguistics Workshops}. YEAR = {2014}}
Even though, a lot of research has already been done on Machine Translation, translating complex sentences has been a stumbling block in the process. To improve the performance of machine translation on complex sentences, simplifying the sentences becomes imperative. In this paper, we present a rule based approach to address this problem by simplifying complex sentences in Hindi into multiple simple sentences. The sentence is split using clause boundaries and dependency parsing which identifies different arguments of verbs, thus changing the grammatical structure in a way that the semantic information of the original sentence stay preserved.
Hindi Word Sketches
ERAGANI ANIL KRISHNA,VARUN K,Dipti Mishra Sharma, Siva Reddy,Adam Kilgarriff
International Conference on Natural Language Processing., ICON, 2014
@inproceedings{bib_Hind_2014, AUTHOR = {ERAGANI ANIL KRISHNA, VARUN K, Dipti Mishra Sharma, Siva Reddy, Adam Kilgarriff}, TITLE = {Hindi Word Sketches}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2014}}
Word sketches are one-page automatic, corpus-based summaries of a word’s grammatical and collocational behaviour. These are widely used for studying a language and in lexicography. Sketch Engine is a leading corpus tool which takes as input a corpus and generates word sketches for the words of that language. It also generates a thesaurus and ‘sketch differences’, which specify similarities and differences between near-synonyms. In this paper, we present the functionalities of Sketch Engine for Hindi. We collected HindiWaC, a web crawled corpus for Hindi with 240 million words. We lemmatized, POS tagged the corpus and then loaded it into Sketch Engine.
A sandhi splitter for malayalam
DEVADATH V V,LITTON J KURISINKEL,Dipti Mishra Sharma,Vasudeva Varma Kalidindi
International Conference on Natural Language Processing., ICON, 2014
@inproceedings{bib_A_sa_2014, AUTHOR = {DEVADATH V V, LITTON J KURISINKEL, Dipti Mishra Sharma, Vasudeva Varma Kalidindi}, TITLE = {A sandhi splitter for malayalam}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2014}}
Sandhi splitting is the primary task for computational processing of text in Sanskrit and Dravidian languages. In these languages, words can join together with morpho-phonemic changes at the point of joining. This phenomenon is known as Sandhi. Sandhi splitter splits the string of conjoined words into individual words. Accurate execution of sandhi splitting is crucial for text processing tasks such as POS tagging, topic modelling and document indexing. We have tried different approaches to address the challenges of sandhi splitting in Malayalam, and finally, we have thought of exploiting the phonological changes that take place in the words while joining. This resulted in a hybrid method which statistically identifies the split points and splits using predefined character level linguistic rules. Currently, our system gives an accuracy of 91.1%.
Identification of Karaka relations in an English sentence
GORTHI SAI KIRAN,ASHISH PALAKURTHI,Radhika Mamidi,Dipti Mishra Sharma
International Conference on Natural Language Processing., ICON, 2014
@inproceedings{bib_Iden_2014, AUTHOR = {GORTHI SAI KIRAN, ASHISH PALAKURTHI, Radhika Mamidi, Dipti Mishra Sharma}, TITLE = {Identification of Karaka relations in an English sentence}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2014}}
In this paper we explain the identification of karaka relations in an English sentence. We explain the genesis of the problem and present two different approaches, rule based and statistical. We briefly describe about rule based and focus more on statistical approach. We process a sentence through various stages and extract features at each stage. We train our data and identify Karaka relations using Support Vector Machines (SVM). We also explain the impact of our work on Natural Language Interfaces for Database systems.
Animacy Annotation in the Hindi Treebank
Itisree Jena,Riyaz Ahmad Bhat,Sambhav Jain,Dipti Mishra Sharma
Linguistic Annotation Workshop, LAW, 2013
@inproceedings{bib_Anim_2013, AUTHOR = {Itisree Jena, Riyaz Ahmad Bhat, Sambhav Jain, Dipti Mishra Sharma}, TITLE = {Animacy Annotation in the Hindi Treebank}, BOOKTITLE = {Linguistic Annotation Workshop}. YEAR = {2013}}
In this paper, we discuss our efforts to annotate nominals in the Hindi Treebank with the semantic property of animacy. Although the treebank already encodes lexical information at a number of levels such as morph and part of speech, the addition of animacy information seems promising given its relevance to varied linguistic phenomena. The suggestion is based on the theoretical and computational analysis of the property of animacy in the context of anaphora resolution, syntactic parsing, verb classification and argument differentiation.
Linguistic Annotation Workshop & Interoperability with Discourse
Itisree Jena,Riyaz Ahmad Bhat,Sambhav Jain,Dipti Mishra Sharma
Linguistic Annotation Workshop, LAW, 2013
@inproceedings{bib_Ling_2013, AUTHOR = {Itisree Jena, Riyaz Ahmad Bhat, Sambhav Jain, Dipti Mishra Sharma}, TITLE = {Linguistic Annotation Workshop & Interoperability with Discourse}, BOOKTITLE = {Linguistic Annotation Workshop}. YEAR = {2013}}
In this paper, we discuss our efforts to annotate nominals in the Hindi Treebank with the semantic property of animacy. Although the treebank already encodes lexical information at a number of levels such as morph and part of speech, the addition of animacy information seems promising given its relevance to varied linguistic phenomena. The suggestion is based on the theoretical and computational analysis of the property of animacy in the context of anaphora resolution, syntactic parsing, verb classification and argument differentiation.