Faculty - Dashboard

@inproceedings{bib_Psyc_2025, AUTHOR = {Rajakrishnan P Rajkumar, Sneha Raman, Aadya Ranjan, Mildred Pereira, Nagesh Nayak, Preeti Rao}, TITLE = {Psycholinguistic Features Predict Word Duration in Hindi Read Aloud Speech}, BOOKTITLE = {International Conference on Acoustics, Speech, and Signal Processing}. YEAR = {2025}}

Reliable assessment of oral reading fluency (ORF) is of great importance in foundational literacy missions globally. For the design of level appropriate testing passages, text difficulty has traditionally been based on coarse-grained measures of readability like the Flesch–Kincaid score. We present a novel study where we deploy psycholinguistic measures of reading difficulty from Natural Language Processing to predict the duration of words in Hindi read-aloud speech. We test the hypotheses that expectation based measures of linguistic complexity are significant predictors of word duration in Hindi read-aloud speech. We validate the stated hypotheses by estimating surprisal measures inspired from Surprisal Theory of sentence comprehension and introduce a novel measure of orthographic complexity to model the intricacies of the Hindi script. Cognitive modelling experiments were conducted on a dataset of six Hindi short stories read aloud by 5 expert readers, containing 2 measures of word duration. Our results show that both surprisal as well as the orthographic complexity measures are significant predictors of word duration. In contrast with long words, we find duration reducing with increased orthographic complexity in the case of short words. The variation between individual speakers in terms of word duration is very low and the variance in the data is caused by the properties of the words used in the text. Finally, we reflect on the implications of our work for cognitive models of language production and for ORF assessment.

@inproceedings{bib_Disc_2025, AUTHOR = {Maitreya Prafulla Chitale, Uday Bindal, Rajakrishnan P Rajkumar, Rahul Mishra}, TITLE = {DiscoGraMS: Enhancing Movie Screen-Play Summarization using Movie Character-Aware Discourse Graph}, BOOKTITLE = {North American Chapter of the Association for Computational Linguistics: Human Language Technologies}. YEAR = {2025}}

Summarizing movie screenplays presents a unique set of challenges compared to standard document summarization. Screenplays are not only lengthy, but also feature a complex interplay of characters, dialogues, and scenes, with numerous direct and subtle relationships and contextual nuances that are difficult for machine learning models to accurately capture and comprehend. Recent attempts at screenplay summarization focus on fine-tuning transformer-based pre-trained models, but these models often fall short in capturing long-term dependencies and latent relationships, and frequently encounter the "lost in the middle" issue. To address these challenges, we introduce DiscoGraMS, a novel resource that represents movie scripts as a movie character-aware discourse graph (CaD Graph). This approach is well-suited for various downstream tasks, such as summarization, question-answering, and salience detection. The model aims to preserve all salient information, offering a more comprehensive and faithful representation of the screenplay's content. We further explore a baseline method that combines the CaD Graph with the corresponding movie script through a late fusion of graph and text modalities, and we present very initial promising results. We have made our code and dataset publicly available.

@inproceedings{bib_Inte_2024, AUTHOR = {Sidharth Ranjan, Sumeet Agarwal, Rajakrishnan P Rajkumar}, TITLE = {Interference predicts locality: Evidence from an SOV language}, BOOKTITLE = {Society for Computation in Linguistics}. YEAR = {2024}}

LOCALITY and INTERFERENCE are two mechanisms which are attested to drive sentence comprehension. However, the relationship between them remains unclear—are they alternative explanations or do they operate independently? To answer this question, we test the hypothesis that in Hindi, interference effects (measured by semantic similarity and case markers) significantly predict locality effects (modelled using dependency length quantifying distance between syntactic heads and their dependents) within a sentence, while controlling for expectation-based measures and discourse givenness. Using data from the HindiUrdu Treebank corpus (HUTB), we validate the stated hypothesis. We demonstrate that sentences with longer dependency length consistently have semantically similar preverbal dependents, more case markers, greater syntactic surprisal, and violate intra-sentential givenness considerations. Overall, our findings point towards the conclusion that locality effects are reducible to broader memory interference effects rather than being distinct manifestations of locality in syntax. Finally, we discuss the implications of our findings for the theories of interference in comprehension.

@inproceedings{bib_A_Sy_2024, AUTHOR = {Aadya Ranjan, Sidharth Ranjan, Rajakrishnan P Rajkumar}, TITLE = {A Systematic Exploration of Linguistic Phenomena in Spoken Hindi Resource Creation and Hypothesis Testing}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2024}}

This paper presents a meticulous and well structured approach to annotating a corpus of Hindi spoken data. We deployed 4 annotators to augment the spoken section of the EMILLE Hindi corpus by marking the various linguistic phenomena observed in spoken data. Then we analyzed various phonological (sound deletion), morphological (code-mixing and reduplication) and syntactic phenomena (case markers and ambiguity), not attested in written data. Code mixing and switching constitute the majority of the phenomena we annotated, followed by orthographic errors related to symbols in the Devanagiri script. In terms of divergences from written form of Hindi, case marker usage, missing auxiliary verbs and agreement patterns are markedly distinct for spoken Hindi. The annotators also assigned a quality rating to each sentence in the corpus. Our analysis of the quality ratings revealed that most of the sentences in the spoken data corpus are of moderate to high quality. Female speakers produced a greater percentage of high quality sentences compared to their male counterparts. While previous efforts in corpus annotation have been largely focused on creating resources for engineering applications, we illustrate the utility of our dataset for scientific hypothesis testing. Inspired from the Surprisal Theory of language comprehension (Hale, 2001; Levy, 2008), we validate the hypothesis that sentences with high values of lexical surprisal are rated low in terms of quality by native speakers, even when controlling for sentence length and word frequencies in a sentence.