Psycholinguistic Features Predict Word Duration in Hindi Read Aloud Speech
Rajakrishnan P Rajkumar,Sneha Raman,Aadya Ranjan,Mildred Pereira,Nagesh Nayak,Preeti Rao
International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 2025
@inproceedings{bib_Psyc_2025, AUTHOR = {Rajakrishnan P Rajkumar, Sneha Raman, Aadya Ranjan, Mildred Pereira, Nagesh Nayak, Preeti Rao}, TITLE = {Psycholinguistic Features Predict Word Duration in Hindi Read Aloud Speech}, BOOKTITLE = {International Conference on Acoustics, Speech, and Signal Processing}. YEAR = {2025}}
Reliable assessment of oral reading fluency (ORF) is of great importance in foundational literacy missions globally. For the design of level appropriate testing passages, text difficulty has traditionally been based on coarse-grained measures of readability like the Flesch–Kincaid score. We present a novel study where we deploy psycholinguistic measures of reading difficulty from Natural Language Processing to predict the duration of words in Hindi read-aloud speech. We test the hypotheses that expectation based measures of linguistic complexity are significant predictors of word duration in Hindi read-aloud speech. We validate the stated hypotheses by estimating surprisal measures inspired from Surprisal Theory of sentence comprehension and introduce a novel measure of orthographic complexity to model the intricacies of the Hindi script. Cognitive modelling experiments were conducted on a dataset of six Hindi short stories read aloud by 5 expert readers, containing 2 measures of word duration. Our results show that both surprisal as well as the orthographic complexity measures are significant predictors of word duration. In contrast with long words, we find duration reducing with increased orthographic complexity in the case of short words.
The variation between individual speakers in terms of word duration is very low and the variance in the data is caused by the properties of the words used in the text. Finally, we reflect on the implications of our work for cognitive models of language production and for ORF assessment.
DiscoGraMS: Enhancing Movie Screen-Play Summarization using Movie Character-Aware Discourse Graph
Maitreya Prafulla Chitale,Uday Bindal,Rajakrishnan P Rajkumar,Rahul Mishra
North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL- HLT, 2025
@inproceedings{bib_Disc_2025, AUTHOR = {Maitreya Prafulla Chitale, Uday Bindal, Rajakrishnan P Rajkumar, Rahul Mishra}, TITLE = {DiscoGraMS: Enhancing Movie Screen-Play Summarization using Movie Character-Aware Discourse Graph}, BOOKTITLE = {North American Chapter of the Association for Computational Linguistics: Human Language Technologies}. YEAR = {2025}}
Summarizing movie screenplays presents a unique set of challenges compared to standard document summarization. Screenplays are not only lengthy, but also feature a complex interplay of characters, dialogues, and scenes, with numerous direct and subtle relationships and contextual nuances that are difficult for machine learning models to accurately capture and comprehend. Recent attempts at screenplay summarization focus on fine-tuning transformer-based pre-trained models, but these models often fall short in capturing long-term dependencies and latent relationships, and frequently encounter the "lost in the middle" issue. To address these challenges, we introduce DiscoGraMS, a novel resource that represents movie scripts as a movie character-aware discourse graph (CaD Graph). This approach is well-suited for various downstream tasks, such as summarization, question-answering, and salience detection. The model aims to preserve all salient information, offering a more comprehensive and faithful representation of the screenplay's content. We further explore a baseline method that combines the CaD Graph with the corresponding movie script through a late fusion of graph and text modalities, and we present very initial promising results. We have made our code and dataset publicly available.
Interference predicts locality: Evidence from an SOV language
Sidharth Ranjan,Sumeet Agarwal,Rajakrishnan P Rajkumar
Society for Computation in Linguistics, SCLG, 2024
Abs | | bib Tex
@inproceedings{bib_Inte_2024, AUTHOR = {Sidharth Ranjan, Sumeet Agarwal, Rajakrishnan P Rajkumar}, TITLE = {Interference predicts locality: Evidence from an SOV language}, BOOKTITLE = {Society for Computation in Linguistics}. YEAR = {2024}}
LOCALITY and INTERFERENCE are two mechanisms which are attested to drive sentence
comprehension. However, the relationship between them remains unclear—are they alternative explanations or do they operate independently? To answer this question, we test
the hypothesis that in Hindi, interference effects (measured by semantic similarity and case
markers) significantly predict locality effects
(modelled using dependency length quantifying distance between syntactic heads and their
dependents) within a sentence, while controlling for expectation-based measures and discourse givenness. Using data from the HindiUrdu Treebank corpus (HUTB), we validate
the stated hypothesis. We demonstrate that sentences with longer dependency length consistently have semantically similar preverbal dependents, more case markers, greater syntactic
surprisal, and violate intra-sentential givenness
considerations. Overall, our findings point towards the conclusion that locality effects are
reducible to broader memory interference effects rather than being distinct manifestations
of locality in syntax. Finally, we discuss the
implications of our findings for the theories of
interference in comprehension.
A Systematic Exploration of Linguistic Phenomena in Spoken Hindi Resource Creation and Hypothesis Testing
Aadya Ranjan,Sidharth Ranjan,Rajakrishnan P Rajkumar
International Conference on Natural Language Processing., ICON, 2024
@inproceedings{bib_A_Sy_2024, AUTHOR = {Aadya Ranjan, Sidharth Ranjan, Rajakrishnan P Rajkumar}, TITLE = {A Systematic Exploration of Linguistic Phenomena in Spoken Hindi Resource Creation and Hypothesis Testing}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2024}}
This paper presents a meticulous and well structured approach to annotating a corpus of
Hindi spoken data. We deployed 4 annotators to augment the spoken section of the
EMILLE Hindi corpus by marking the various linguistic phenomena observed in spoken
data. Then we analyzed various phonological
(sound deletion), morphological (code-mixing
and reduplication) and syntactic phenomena
(case markers and ambiguity), not attested in
written data. Code mixing and switching constitute the majority of the phenomena we annotated, followed by orthographic errors related
to symbols in the Devanagiri script. In terms
of divergences from written form of Hindi,
case marker usage, missing auxiliary verbs and
agreement patterns are markedly distinct for
spoken Hindi. The annotators also assigned
a quality rating to each sentence in the corpus. Our analysis of the quality ratings revealed that most of the sentences in the spoken data corpus are of moderate to high quality. Female speakers produced a greater percentage of high quality sentences compared to
their male counterparts. While previous efforts
in corpus annotation have been largely focused
on creating resources for engineering applications, we illustrate the utility of our dataset
for scientific hypothesis testing. Inspired from
the Surprisal Theory of language comprehension (Hale, 2001; Levy, 2008), we validate the
hypothesis that sentences with high values of
lexical surprisal are rated low in terms of quality by native speakers, even when controlling
for sentence length and word frequencies in a
sentence.