Post-Net2.0: An adaptive weighted loss function driven by linguistic constraint for automatic syllable stress detection
@inproceedings{bib_Post_2025, AUTHOR = {Aluru Sai Harshitha, Mallela Jhansi, Chiranjeevi Yarra}, TITLE = {Post-Net2.0: An adaptive weighted loss function driven by linguistic constraint for automatic syllable stress detection}, BOOKTITLE = {International Conference on Acoustics, Speech, and Signal Processing}. YEAR = {2025}}
Automatic syllable stress detection is an essential component in Computer assisted language learning (CALL) systems to guide non-native language learners. In English, each word typically contains only one primary stressed syllable. However, standard loss functions, such as Binary Cross-Entropy (BCE), often result in predictions where multiple syllables may be stressed or none at all. As a result, automatic syllable stress detection models frequently require an additional post-processing step to ensure that only one syllable is stressed per word.
This reliance on post-processing suggests that the model is not fully capturing the stress patterns accurately. To address this issue, we propose an adaptive weighted loss function that builds upon the Stress Intensity Modulation Loss proposed in our recent work of Post-Net. This adaptive weighted loss function is designed to enforce the constraint of a single primary stressed syllable directly during model training. We integrate this loss function into the previously proposed Post-Net (PN_DNN) and on a new architecture which is a hybrid of Post-Net and LSTM (PN_DLSTM). Their performance is compared against the state-of-the-art models trained with standard BCE loss. Experiments conducted on the ISLE corpus reveal that both the models trained only with BCE loss show a significant accuracy gap between with and without post-processing. In contrast, when these models are trained on the proposed adaptive weighted loss function, the gap is narrowed
in both the models.
Between the two models, the highest reduction is observed in PN_DNN with a decrease from 3.87% to 2.45% & 4.6% to 3.75% for GER & ITA respectively. This indicates that the adaptive weighted loss function effectively captures the linguistic constraint during training, reducing the need for post-processing.
Evaluating the Impact of Discriminative and Generative E2E Speech Enhancement Models on Syllable Stress Preservation
Rangavajjala Sankara Bharadwaj,Mallela Jhansi,Aluru Sai Harshitha,Chiranjeevi Yarra
@inproceedings{bib_Eval_2025, AUTHOR = {Rangavajjala Sankara Bharadwaj, Mallela Jhansi, Aluru Sai Harshitha, Chiranjeevi Yarra}, TITLE = {Evaluating the Impact of Discriminative and Generative E2E Speech Enhancement Models on Syllable Stress Preservation}, BOOKTITLE = {International Conference on Acoustics, Speech, and Signal Processing}. YEAR = {2025}}
Automatic syllable stress detection is a crucial component in Computer-Assisted Language Learning (CALL) systems for language learners. Current stress detection models are typically trained on clean speech, which may not be robust in real-world scenarios where background noise is prevalent. To address this, speech enhancement (SE) models, designed to enhance speech by removing noise, might be employed, but their impact on preserving syllable stress patterns is not well studied. This study examines how different SE models, representing discriminative and generative modeling approaches, affect syllable stress detection under noisy conditions. We assess these models by applying them to speech data with varying signal-to-noise ratios (SNRs) from 0 to 20 dB, and evaluating their effectiveness in maintaining stress patterns. Additionally, we explore different feature sets to determine which ones are most effective for capturing stress patterns amidst noise. To further understand the impact of SE models, a human-based perceptual study is conducted to compare the perceived stress patterns in SE-enhanced speech with those in clean speech, providing insights into how well these models preserve syllable stress as perceived by listeners. Experiments are performed on English speech data from non-native speakers of German and Italian. And the results reveal that the stress detection performance is robust with the generative SE models when heuristic features are used. Also, the observations from the perceptual study are consistent with the stress detection outcomes under all SE models.
A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings
Anindita Mondal,Rangavajjala Sankara Bharadwaj,Mallela Jhansi,Anil Kumar Vuppala,Chiranjeevi Yarra
@inproceedings{bib_A_Pr_2025, AUTHOR = {Anindita Mondal, Rangavajjala Sankara Bharadwaj, Mallela Jhansi, Anil Kumar Vuppala, Chiranjeevi Yarra}, TITLE = {A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings}, BOOKTITLE = {Technical Report}. YEAR = {2025}}
Automatic detection of prominence at the word and syllable-levels is critical for building computer-assisted language learning systems. It has been shown that prosody embeddings learned by the current state-of-the-art (SOTA) text-to-speech (TTS) systems could generate word- and syllable-level prominence in the synthesized speech as natural as in native speech. To understand the effectiveness of prosody embeddings from TTS for prominence detection under nonnative context, a comparative analysis is conducted on the embeddings extracted from native and non-native speech considering the prominence-related embeddings: duration, energy, and pitch from a SOTA TTS named FastSpeech2. These embeddings are extracted under two conditions considering: 1) only text, 2) both speech and text. For the first condition, the embeddings are extracted directly from the TTS inference mode, whereas for the second condition, we propose to extract from the TTS under training mode. Experiments are conducted on native speech corpus: Tatoeba, and non-native speech corpus: ISLE. For experimentation, word-level prominence locations are manually annotated for both corpora. The highest relative improvement on word & syllable-level prominence detection accuracies with the TTS embeddings are found to be 13.7% & 5.9% and 16.2% & 6.9% compared to those with the heuristic-based features and self-supervised Wav2Vec-2.0 representations, respectively.
IIITH Ucchar e-Sudharak: An automatic English pronunciation corrector for
school-going children with a teacher in the loop
@inproceedings{bib_IIIT_2024, AUTHOR = {Chiranjeevi Yarra}, TITLE = {IIITH Ucchar e-Sudharak: An automatic English pronunciation corrector for
school-going children with a teacher in the loop}, BOOKTITLE = {Proceedings of Interspeech}. YEAR = {2024}}
The language acquisition skills are predominant during
childhood. Due to the lack of resources, a gap often arises in
the language communication abilities of students. To bridge this
gap, it is necessary to consider automatic methods for improving the children language skills. Our IIITH Ucchar e-Sudharak
web based tool, deployed in a real environment, helps schoolgoing children to practice English reading skills with a teacher
in the loop. The choice of English as the target language is
strategic, given its status as a global lingua franca. Within the
tool, children are provided with class and chapter-specific English sentences for practice tailored by their teachers. During
practice sessions, the tool provides real-time feedback in the
form of sentence and word-level scores, which are computed
by comparing against teacher (expert) audio recordings. These
scores offer insights into word-level pronunciation correctness
and sentence-level quality by measuring fluency. Moreover, the
tool provides a key feature where teachers can assign specific
sentences to individual students and assess their performance.
This functionality empowers teachers to personalize their instruction, catering to the specific needs of each student. As far
as we know, this is a one-of-a-kind tool specifically designed to
meet the needs of school-going children.
Sona Binu,Jismi Jose,Fathima Shimna KV,Alino Luke Hans,Reni K Cherian,Starlet Ben Alex,Priyanka Srivastava,Chiranjeevi Yarra
@inproceedings{bib_Lang_2024, AUTHOR = {Sona Binu, Jismi Jose, Fathima Shimna KV, Alino Luke Hans, Reni K Cherian, Starlet Ben Alex, Priyanka Srivastava, Chiranjeevi Yarra}, TITLE = {Language-Agnostic Analysis of Speech Depression Detection}, BOOKTITLE = {India Council International Conference}. YEAR = {2024}}
The people with Major Depressive Disorder (MDD) exhibit the symptoms of tonal variations in their speech compared to the healthy counterparts. However, these tonal variations not only confine to the state of MDD but also on the language, which has unique tonal patterns. This work analyzes automatic speech-based depression detection across two languages, English and Malayalam, which exhibits distinctive prosodic and phonemic characteristics. We propose an approach that utilizes speech data collected along with self-reported labels from participants reading sentences from IViE corpus, in both English and Malayalam. The IViE corpus consists of five sets of sentences: simple sentences, WH-questions, questions without morphosyntactic markers, inversion questions and coordinations, that can naturally prompt speakers to speak in different tonal patterns. Convolutional Neural Networks (CNNs) are employed for detecting depression from speech. The CNN model is trained to identify acoustic features associated with depression in speech, focusing on both languages. The model's performance is evaluated on the collected dataset containing recordings from both depressed and non-depressed speakers, analyzing its effectiveness in detecting depression across the two languages. Our findings and collected data could contribute to the development of language-agnostic speech-based depression detection systems, thereby enhancing accessibility for diverse populations.
IIITSaint-EmoMDB: Carefully Curated Malayalam Speech Corpus with Emotion and Self-Reported Depression Ratings
Christa Thomas,Guneesh Vats,Aravind Johnson,Ashin George,Talit Sara George,Reni K Cherian,Priyanka Srivastava,Chiranjeevi Yarra
@inproceedings{bib_IIIT_2024, AUTHOR = {Christa Thomas, Guneesh Vats, Aravind Johnson, Ashin George, Talit Sara George, Reni K Cherian, Priyanka Srivastava, Chiranjeevi Yarra}, TITLE = {IIITSaint-EmoMDB: Carefully Curated Malayalam Speech Corpus with Emotion and Self-Reported Depression Ratings}, BOOKTITLE = {Conference of the Oriental COCOSDA}. YEAR = {2024}}
Mental health conditions such as depression and anxiety are pervasive issues that affect millions of individuals worldwide. Understanding the intricate relationship between emotional perception and mental health is crucial. This work develops a database named IIITSaint-EmoMDB containing Malayalam language speech samples annotated with emotion ratings at valence and arousal scale using three annotators considering four emotion labels: happy, sad, angry and neutral. In addition to emotion ratings, emotional perception is rated from 150 participants according to the circumplex model of va-lence and arousal. The database also consists of self-reported mental health scores of those participants based on PHQ-9 and GAD-7 questionnaire. Our preliminary analysis of comparing the participants' emotional perception based PHQ-9 and GAD-7 scores revealed that the depressed category showed the highest improvement.
InStant-EMDB: A Multi Model Spontaneous English and Malayalam Speech Corpora for Depression Detection
Anjali Mathew,Harsha Sanjan,Reni K Cherian,Starlet Ben Alex,Priyanka Srivastava,Chiranjeevi Yarra
@inproceedings{bib_InSt_2024, AUTHOR = {Anjali Mathew, Harsha Sanjan, Reni K Cherian, Starlet Ben Alex, Priyanka Srivastava, Chiranjeevi Yarra}, TITLE = {InStant-EMDB: A Multi Model Spontaneous English and Malayalam Speech Corpora for Depression Detection}, BOOKTITLE = {Conference of the Oriental COCOSDA}. YEAR = {2024}}
Depression, a pervasive mental health issue, highlights the
critical need for early detection and effective intervention.
A database (InStant-EMDB) is developed to analyze varia
tions in the spoken responses of individuals with depression
compared to healthy counterparts. Spoken responses are ob
tained from English and Malayalam bilingual speakers who
respond spontaneously to a set of 15 emotionally evocative
words. These words are sourced from the Affective Norms
for English Words (ANEW) dataset, which includes valence
and arousal ratings for each word. The speech in both English
and Malayalam is manually transcribed to accurately reflect
the spoken content. The dataset also contains self-reported af
fective ratings and data from a mental health survey (PHQ-9)
collected from the participants to determine their mental state,
which we considered the self-reported depression labels. A
preliminary analysis is conducted on the collected speech us
ing current state-of-the-art deep learning models such as Con
volutional Neural Networks (CNN), Long Short-Term Mem
ory networks (LSTM) and Bi-directional Long Short-Term
Memory networks (Bi-LSTM). Although distinct linguistic
patterns exhibited by individuals struggling with depression
are successfully identified by all three models in both Malay
alam and English spoken responses, the highest accuracy is
achieved by the LSTM models. Our findings and dataset em
phasize the potential of linguistic patterns as valuable cues
for the early identification and intervention of depression and
could contribute to enhancing accessibility for diverse popu
lations.
IIIT-Speech Twins 1.0: An English-Hindi Parallel Speech Corpora for Speech-to-Speech Machine Translation and Automatic Dubbing
@inproceedings{bib_IIIT_2024, AUTHOR = {Anindita Mondal, Anil Kumar Vuppala, Chiranjeevi Yarra}, TITLE = {IIIT-Speech Twins 1.0: An English-Hindi Parallel Speech Corpora for Speech-to-Speech Machine Translation and Automatic Dubbing}, BOOKTITLE = {Conference of the Oriental COCOSDA}. YEAR = {2024}}
The demand for high-quality parallel speech data has been increasing as deep learning-based Speech to Speech Machine Translation (SSMT) and automatic dubbing approaches gain popularity in speech applications. Traditional and well-established speech applications such as Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) heavily rely on large corpus of monolingual speech and the corresponding text. While there is a wealth of parallel text data available for both English and Indic languages, parallel speech data is available only for English and other European languages, yet it often lacks natural prosody and semantic alignment between the languages. For achieving cross-lingual prosody transfer, end-to-end SSMT models, and high-quality dubbing from English to Hindi, in this work, an English-Hindi parallel bilingual speech-text corpus named lIlT-Speech Twins 1.0 is created. This data contains twin-like English and Hindi speech-text pairs obtained from publicly available children's stories in both the languages, through manual and automatic processing. Starting with 8 stories in each language, totaling around 4 hours of audio, the final outcome was a 2-hour dataset. This was achieved through systematic segmentation, re-moval of non-speech background audio, and sentence-by-sentence alignment to ensure accurate meaning in both languages. In addition to ensuring proper alignment and transcription, this dataset offers a rich source of natural prosody, expressions, and emotions, due to the narrative diversity within the stories. The dataset also provides sig-nificant speaker variability, with different characters being voiced by various speakers, enhancing the richness of the lIlT-Speech Twins 1.0 corpus.
Post-Net: A linguistically inspired sequence-dependent transformed neural architecture for automatic syllable stress detection
@inproceedings{bib_Post_2024, AUTHOR = {Sai Harshitha, Mallela Jhansi, Chiranjeevi Yarra}, TITLE = {Post-Net: A linguistically inspired sequence-dependent transformed neural architecture for automatic syllable stress detection}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2024}}
Automatic syllable stress detection methods typically consider syllable-level features as independent. However, as per linguistic studies, there is a dependency among the syllables within a word. In this work, we address this issue by proposing a Post-Net approach using Time-Delay Neural Networks to exploit the syllable dependency in a word for stress detection task. For this, we propose a loss function to incorporate the dependency by ensuring only one stressed syllable in a word. The proposed Post-Net leverages the existing SOTA sequence-independent stress detection models and learns in both supervised and unsupervised settings. We compare the Post-Net with three existing SOTA sequence-independent models and also with sequential model (LSTMs). Experiments conducted on ISLE corpus show the highest relative accuracy improvement of 2.1% and 20.28% with the proposed Post-Net compared to the best sequenceindependent SOTA model in supervised and unsupervised manners, respectively.
A comparative analysis of sequential models that integrate syllable dependency for automatic syllable stress detection
@inproceedings{bib_A_co_2024, AUTHOR = {Mallela Jhansi, Sai Harshitha, Chiranjeevi Yarra}, TITLE = {A comparative analysis of sequential models that integrate syllable dependency for automatic syllable stress detection}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2024}}
Automatic syllable stress detection is typically operated at syllable level with stress-related acoustic features. The stress placed on a syllable is influenced not only by its own characteristics but also by its context in the word. However, traditional methods for stress detection overlook the contextual acoustic factors that influence stress placement. By addressing this issue, we study sequential modeling approaches by integrating the syllable dependency for automatic syllable stress detection using a masking strategy. This approach considers a sequence of syllables at the word level and identifies its stress label sequence. We explore various sequential models, such as RNNs, LSTMs, GRUs, and Attention networks. We conduct experiments on the ISLE corpus comprising non-native speakers speaking English. From the experiments, we observe a significant improvement in the performance with all sequential models compared to the state-of-the-art non-sequential baseline (DNN).
Exploring the Use of Self-Supervised Representations for Automatic Syllable Stress Detection
Mallela Jhansi,Sai Harshitha,Chiranjeevi Yarra
National Conference on Communications, NCC, 2024
@inproceedings{bib_Expl_2024, AUTHOR = {Mallela Jhansi, Sai Harshitha, Chiranjeevi Yarra}, TITLE = {Exploring the Use of Self-Supervised Representations for Automatic Syllable Stress Detection}, BOOKTITLE = {National Conference on Communications}. YEAR = {2024}}
The task of automatically detecting syllable stress is a key module in computer-assisted language learning systems. There are numerous studies proposed in the literature for automatic syllable stress detection by using different knowledge-based prosodic features. Also, different statistical machine learning and deep learning models are explored for this task using knowledge-based features. However, the acoustic parameters considered to compute knowledge-based features might not always represent the stress phenomena, hence the knowledge based features are not always suitable for generalization and scalability. Recently, the rapidly emerging self-supervised learning based representations are outperforming the existing state-of-the-art knowledge-based features in all speech applications. Also, these representations allow the models to be built in an end-to-end fashion. In this work, we explore the use of self-supervised representations (Wav2Vec-2.0), for syllable stress detection and compare the performance with state-of-the-art knowledge-based features. Further, we use our recently proposed explicit representation learning framework, modeled by jointly optimizing variational autoencoder (VAE) and DNN for stress detection. We analyze the performance of representation learning framework with two different state-of-the-art classifiers, support vector machines (SVM) and simple deep neural network (DNN). We conduct experiments on two non-native English speakers’ datasets from ISLE corpus i.e., German (GER), and Italian (ITA). From the analysis study, it is observed that the classification accuracy for syllable stress detection using self-supervised representations significantly improved by 3.2% and 2.7% over knowledge-based features in GER and ITA, respectively. From the t-SNE plots, it is observed that the representations learned by explicit representation learning framework with VAE show better discrimination among stressed and unstressed syllables compared with representations learned implicitly with simple DNN.
An Investigation of Indian Native Language Phonemic Influences on L2 English Pronunciations
Shelly Jain,Priyanshi Pal,Anil Kumar Vuppala,Prasanta Kumar Ghosh,Chiranjeevi Yarra
Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023
@inproceedings{bib_An_I_2023, AUTHOR = {Shelly Jain, Priyanshi Pal, Anil Kumar Vuppala, Prasanta Kumar Ghosh, Chiranjeevi Yarra}, TITLE = {An Investigation of Indian Native Language Phonemic Influences on L2 English Pronunciations}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2023}}
Speech systems are sensitive to accent variations. This is a challenge in India, which has numerous languages but few linguistic studies on pronunciation variation. The growing number of L2 English speakers reinforces the need to study accents and L1-L2 interactions. We investigate Indian English (IE) accents
and report our observations on regional and shared features. Specifically, we observe phonemic variations and phonotactics in speakers’ native languages and apply this to their English pronunciations. We demonstrate the influence of 18 Indian languages on IE by comparing native language features with IE
pronunciations obtained from literature studies and phonetically annotated speech. Hence, we validate Indian language influences on IE by justifying pronunciation rules from the perspective of Indian language phonology. We obtain a comprehensive
description of generalised and region-specific IE characteristics, which facilitates accent adaptation of existing speech systems.
A Comparison of Learned Representations with Jointly Optimized VAE and DNN for Syllable Stress Detection
Mallela Jhansi,Boyina Prasanth Sai,Chiranjeevi Yarra
International Conference on Speech and Computers, SPECOM, 2023
Abs | | bib Tex
@inproceedings{bib_A_Co_2023, AUTHOR = {Mallela Jhansi, Boyina Prasanth Sai, Chiranjeevi Yarra}, TITLE = {A Comparison of Learned Representations with Jointly Optimized VAE and DNN for Syllable Stress Detection}, BOOKTITLE = {International Conference on Speech and Computers}. YEAR = {2023}}
Automatic syllable stress detection is helpful in assessing L2 learners’ pronunciation. In this work, for stress detection, we propose a representation learning framework by jointly optimizing VAE and DNN. The obtained representations from the proposed VAE plus DNN framework are compared with the implicit representations learned from DNN-based stress detection. Further, we compare the obtained representations from VAE plus DNN with those obtained from autoencoder (AE) plus DNN, and sparse-autoencoder (SAE) plus DNN considering with/without implicit representations from DNN. We perform the experiments on the ISLE corpus consisting of English utterances from German and Italian native speakers. We observe that the detection performance with the learned representations from VAE plus DNN is significantly better than that with the state-of-the-art method without any representation learning with the highest improvement of 2.2%, 5.1%, and 1.4% under matched, combined, and cross scenarios, respectively.
Study of Indian English Pronunciation Variabilities Relative to Received Pronunciation
Priyanshi Pal,Shelly Jain,Chiranjeevi Yarra,Prasanta Kumar Ghosh,Anil Kumar Vuppala
International Conference on Speech and Computers, SPECOM, 2023
Abs | | bib Tex
@inproceedings{bib_Stud_2023, AUTHOR = {Priyanshi Pal, Shelly Jain, Chiranjeevi Yarra, Prasanta Kumar Ghosh, Anil Kumar Vuppala}, TITLE = {Study of Indian English Pronunciation Variabilities Relative to Received Pronunciation}, BOOKTITLE = {International Conference on Speech and Computers}. YEAR = {2023}}
Analysis of Indian English (IE) pronunciation variabilities is useful in ASR and TTS modelling for the Indian context. Prior works characterised IE variabilities by reporting qualitative phonetic rules relative to Received Pronunciation (RP). However, such characterisations lack quantitative descriptors and data-driven analysis of diverse IE pronunciations, which could be due to the scarcity of phonetically labelled data. Furthermore, the versatility of IE stems from the influence of a large diversity of the speakers’ mother tongues and demographic region differences. To address these issues, we consider the corpus Indic TIMIT and manually obtain 13, 632 phonetic transcriptions in addition to those parts of the corpus. By performing a data-driven analysis on 15, 974 phonetic transcriptions of 80 speakers from diverse regions of India, we present a new set of phonetic rules and validate them against the existing phonetic rules to identify their relevance. Finally, we test the efficacy of Grapheme-to-Phoneme (G2P) conversion developed based on the obtained rules considering Phoneme Error Rate (PER) as the metric for performance.
Analysis of Natural Language Understanding Systems with L2 Learner Specific Synthetic Grammatical Errors Based on Parts-of-Speech
Snehal Ranjan,Nanduri Venkata Sai Krishna Kalyan,Prakul Virdi,Chiranjeevi Yarra
International Conference on Speech and Computers, SPECOM, 2023
@inproceedings{bib_Anal_2023, AUTHOR = {Snehal Ranjan, Nanduri Venkata Sai Krishna Kalyan, Prakul Virdi, Chiranjeevi Yarra}, TITLE = {Analysis of Natural Language Understanding Systems with L2 Learner Specific Synthetic Grammatical Errors Based on Parts-of-Speech}, BOOKTITLE = {International Conference on Speech and Computers}. YEAR = {2023}}
Second language learners often make grammatical mistakes, which can impact the performance of Spoken Language Understanding (SLU) systems. SLU systems consist of two key components: Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU). This work focuses on the effects of grammatical errors which were synthetically generated by manipulating Parts of Speech (POS) tokens on the NLU module. Our analysis comprises three main aspects. Firstly, we assess the impact of grammatical errors on the overall performance of NLU systems, specifically in the domains of Intent Detection and Slot Filling. Secondly, we investigate NLU performance concerning POS tags. Lastly, we utilize an Attention-based NLU model to evaluate the significance of different POS. This study evaluates NLU models on ATIS and SNIPS datasets for Intent Detection and Slot Filling tasks with introduced grammatical errors, leading to a significant performance drop as large as 4.7% for Intent Detection and 5.1% for Slot Filling. However, training on datasets with synthetic errors helps mitigate this drop, and the models are shown to be sensitive to increasing levels of corrupted tokens. Additionally, attention analysis reveals Proper Nouns’ higher weights in determining intent, while specific POS corruption, particularly Auxiliary Verbs and Nouns, is more likely to cause intent misclassification.
Unsupervised pronunciation assessment analysis using utterance level alignment distance with self-supervised representations
Nayan Anand,Sirigiraju Meenakshi,Chiranjeevi Yarra
India Council International Conference, INDICON, 2023
@inproceedings{bib_Unsu_2023, AUTHOR = {Nayan Anand, Sirigiraju Meenakshi, Chiranjeevi Yarra}, TITLE = {Unsupervised pronunciation assessment analysis using utterance level alignment distance with self-supervised representations}, BOOKTITLE = {India Council International Conference}. YEAR = {2023}}
The pronunciation quality of second language (L2) learners can be affected by different factors including the fol- lowing seven factors: Intelligibility, Intonation, Phoneme quality, mispronunciation, Mother tongue influence, Correct placement of pause and Correct stress placement. An automatic assessment of these seven factors could be helpful for developing computer- assisted language learning systems. In this work, we assess the quality of all seven factors considering an unsupervised approach using DTW based utterance level alignment distance between expert’s and learner’s speech. Unlike the existing works that consider factor specific heuristic based features, we explore Wav2Vec-2.0 based self-supervised representations as a feature for assessing all the seven factors. The distance is computed using the following three distance metrics: Mean absolute error (MAE), Mean squared error (MSE) and Cosine distance (CD). Experiments are conducted on voisTUTOR corpus containing spoken English speech samples from 16 Indian L2 learners annotated with binary ratings (1 and 0) for all the factors. Using each distance metric, for each learner’s speech, four distance values are computed with a set of four expert samples. In the assessment, to circumvent the need for parallel expert data, we consider two (out of four) expert samples synthesized from the state-of-the-art text-to-speech (TTS) systems. We observe that the performance with the considered distance metric based unsupervised assessment approach is significantly more than that with the baseline for six out of seven factors under all three distance metrics and all four experts samples.
IIITH MM2 Speech-Text: A preliminary data for automatic spoken data validation with matched and mismatched speech-text content
Nayan Anand,Sirigiraju Meenakshi,Chiranjeevi Yarra
Conference of the Oriental COCOSDA, O-COCOSDA, 2023
@inproceedings{bib_IIIT_2023, AUTHOR = {Nayan Anand, Sirigiraju Meenakshi, Chiranjeevi Yarra}, TITLE = {IIITH MM2 Speech-Text: A preliminary data for automatic spoken data validation with matched and mismatched speech-text content}, BOOKTITLE = {Conference of the Oriental COCOSDA}. YEAR = {2023}}
The demand for high-quality speech data has been increasing as deep-learning approaches gain popularity in speech applications. Among these, automatic speech recognition (ASR) and text-to-speech (TTS) require large amount of data contain- ing speech and the corresponding text. For these applications, high-quality data is often obtained through manual validation, which ensures matching between speech and text. The manual validation is not scalable as per the demand due to the cost and time involved. In order to cater to the high-quality data demand, validating the data automatically could be useful. In this work, for automatic data validation, a spoken English corpus named IIITH MM2 Speech-Text is created, containing matched and mismatched speech-text pairs under read speech conditions from Indian speakers with different nativities. For the creation, we consider 100 unique stimuli selected from the TIMIT corpus, ensuring phonetic richness, for which a joint entropy maximization is proposed. These stimuli are recorded from 50 speakers, resulting in matched and mismatched sets containing 5000 and 764 utterances with a total duration of 6 hours and 1 hour, respectively. The mismatched set contains speech from the instances where the speakers naturally made spoken errors while reading the reference text. It also contains two stimuli per utterance, one stimulus is the reference text, and the other is manually annotated text that reflects the erroneous speech. Thus, the reference and the annotated text are used for building the models of speech-text mismatch detection and correction, respectively. To the best of our knowledge, no such corpora exist containing both matched and mismatched speech-text. As a preliminary analysis for speech-text mismatch detection, a baseline considering Wav2Vec-2.0 representations and DTW results in the detection F1-score of 0.87
Can the decoded text from automatic speech recognition effectively detect spoken grammar errors
Chowdam Venkata Thirumala Kumar,Sirigiraju Meenakshi,Rakesh Vaideeswaran,Prasanta Kumar Ghosh,Chiranjeevi Yarra
Workshop on Speech and Language Technology in Education, SLaTE, 2023
@inproceedings{bib_Can__2023, AUTHOR = {Chowdam Venkata Thirumala Kumar, Sirigiraju Meenakshi, Rakesh Vaideeswaran, Prasanta Kumar Ghosh, Chiranjeevi Yarra}, TITLE = {Can the decoded text from automatic speech recognition effectively detect spoken grammar errors}, BOOKTITLE = {Workshop on Speech and Language Technology in Education}. YEAR = {2023}}
Language learning involves the correct acquisition of grammar skills. To facilitate learning with computer-assisted systems, automatic spoken grammatical error detection (SGED) is necessary. This work explores Automatic Speech Recognition (ASR), which decodes text from speech, for SGED. With current advancements in ASR technology, often it can be believed that these systems could capture spoken grammatical errors in the decoded text. However, these systems have an inherent bias from the language model towards the grammatically correct text. We explore the ASR-decoded text from commercially available current state-of-the-art systems considering a text-based GED algorithm and also its word-level confidence score (CS) for SGED. We perform the experiments on the spoken English data collected in-house from 13 subjects speaking 4110 grammatically erroneous and correct sentences. We found the highest relative improvement in SGED with CS is 15.36% compared to that with decoded text plus GED.
Unsupervised speech intelligibility assessment with utterance level alignment distance between teacher and learner Wav2Vec-2.0 representations
Nayan Anand,Sirigiraju Meenakshi,Chiranjeevi Yarra
Technical Report, arXiv, 2023
@inproceedings{bib_Unsu_2023, AUTHOR = {Nayan Anand, Sirigiraju Meenakshi, Chiranjeevi Yarra}, TITLE = {Unsupervised speech intelligibility assessment with utterance level alignment distance between teacher and learner Wav2Vec-2.0 representations}, BOOKTITLE = {Technical Report}. YEAR = {2023}}
Speech intelligibility is crucial in language learning for effective communication. Thus, to develop computer-assisted language learning systems, automatic speech intelligibility detection (SID) is necessary. Most of the works have assessed the intelligibility in a supervised manner considering manual annotations, which requires cost and time; hence scalability is limited. To overcome these, this work proposes an unsupervised approach for SID. The proposed approach considers alignment distance computed with dynamic-time warping (DTW) between teacher and learner representation sequence as a measure to separate intelligible versus non-intelligible speech. We obtain the feature sequence using current state-of-the-art self-supervised representations from Wav2Vec-2.0. We found the detection accuracies as 90.37\%, 92.57\% and 96.58\%, respectively, with three alignment distance measures -- mean absolute error, mean squared error and cosine distance (equal to one minus cosine similarity).
mulEEG: A Multi-view Representation Learning on EEG Signals
Vamsi Kumar,Likith Reddy,Shivam Kumar Sharma,Kamalakar Dadi ,Chiranjeevi Yarra
International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI, 2022
@inproceedings{bib_mulE_2022, AUTHOR = {Vamsi Kumar, Likith Reddy, Shivam Kumar Sharma, Kamalakar Dadi , Chiranjeevi Yarra}, TITLE = {mulEEG: A Multi-view Representation Learning on EEG Signals}, BOOKTITLE = {International Conference on Medical Image Computing and Computer Assisted Intervention}. YEAR = {2022}}
Modeling effective representations using multiple views that positively influence each other is challenging, and the existing methods perform poorly on Electroencephalogram (EEG) signals for sleepstaging tasks. In this paper, we propose a novel multi-view self-supervised method (mulEEG) for unsupervised EEG representation learning. Our method attempts to effectively utilize the complementary information available in multiple views to learn better representations. We introduce diverse loss that further encourages complementary information across multiple views. Our method with no access to labels, beats the supervised training while outperforming multi-view baseline methods on transfer learning experiments carried out on sleep-staging tasks. We posit that our method was able to learn better representations by using complementary multi-views.
Voistutor 2.0: A Speech Corpus with Phonetic Transcription for Pronunciation Evaluation of Indian L2 English Learners
Priyanshi Pal,Chiranjeevi Yarra,Prasanta Kumar Ghosh
Conference of the Oriental COCOSDA, O-COCOSDA, 2022
@inproceedings{bib_Vois_2022, AUTHOR = {Priyanshi Pal, Chiranjeevi Yarra, Prasanta Kumar Ghosh}, TITLE = {Voistutor 2.0: A Speech Corpus with Phonetic Transcription for Pronunciation Evaluation of Indian L2 English Learners}, BOOKTITLE = {Conference of the Oriental COCOSDA}. YEAR = {2022}}
In computer assisted pronunciation training (CAPT), robust automatic models are critical for pronunciation assessment and mispronunciation detection and diagnosis (MDD). In the modelling, besides the audio data of second language (L2) learners, CAPT requires manually annotated ratings of over- all pronunciation quality, and the MDD uses manually anno- tated phonetic transcriptions. Though the pronunciation qual- ity and the mispronunciation are interdependent, to the best of our knowledge, none of the existing corpora contains both ratings and phonetic transcriptions. This could be due to the cost involved in obtaining phonetic transcriptions. However, a corpus with both kinds of information could benefit the re- searchers to obtain robust models by exploring the interde- pendencies. For addressing this, we develop voisTUTOR 2.0 corpus considering the existing voisTUTOR corpus referred to as voisTUTOR 1.0. We obtain phonetic transcriptions man- ually from a linguist for the entire Indian L2 learners’ En- glish audio data (26529 utterances approximately 14 hours) in voisTUTOR 1.0 for which overall quality ratings and bi- nary scores of factors influencing the pronunciation quality are available. A preliminary analysis of voisTUTOR 2.0 sug- gests that the phonetic errors correlated with the ratings and the binar
An Investigation of Indian Native Language Phonemic Influences on L2 English Pronunciations
Shelly Jain,Priyanshi Pa,Anil Kumar Vuppala,Prasanta Ghosh,Chiranjeevi Yarra
Technical Report, arXiv, 2022
@inproceedings{bib_An_I_2022, AUTHOR = {Shelly Jain, Priyanshi Pa, Anil Kumar Vuppala, Prasanta Ghosh, Chiranjeevi Yarra}, TITLE = {An Investigation of Indian Native Language Phonemic Influences on L2 English Pronunciations}, BOOKTITLE = {Technical Report}. YEAR = {2022}}
Speech systems are sensitive to accent variations. This is especially challenging in the Indian context, with an abundance of languages but a dearth of linguistic studies characterising pronunciation variations. The growing number of L2 English speakers in India reinforces the need to study accents and L1-L2 interactions. We investigate the accents of Indian English (IE) speakers and report in detail our observations, both specific and common to all regions. In particular, we observe the phonemic variations and phonotactics occurring in the speakers' native languages and apply this to their English pronunciations. We demonstrate the influence of 18 Indian languages on IE by comparing the native language pronunciations with IE pronunciations obtained jointly from existing literature studies and phonetically annotated speech of 80 speakers. Consequently, we are able to validate the intuitions of Indian language influences on IE pronunciations by justifying pronunciation rules from the perspective of Indian language phonology. We obtain a comprehensive description in terms of universal and region-specific characteristics of IE, which facilitates accent conversion and adaptation of existing ASR and TTS systems to different Indian accents.
Automatic syllable stress detection under non-parallel label and data condition
Chiranjeevi Yarra, Prasanta Kumar Ghosh
Speech Communication, SpComm, 2022
@inproceedings{bib_Auto_2022, AUTHOR = {Chiranjeevi Yarra, Prasanta Kumar Ghosh}, TITLE = {Automatic syllable stress detection under non-parallel label and data condition}, BOOKTITLE = {Speech Communication}. YEAR = {2022}}
Typically, automatic syllable stress detection is posed as a supervised classification problem, for which, a classifier is trained using manually annotated (existing) syllable data and stress labels. However, in real testing scenarios, syllable data is estimated since manual annotation is not possible. Further, the estimation process could result in a mismatch between the lengths of the estimated and the existing syllable data causing no one- to-one correspondence between the estimated syllable data and the existing labels. Hence, the existing labels and estimated syllable data together cannot be used to train the classifier. This can be avoided by manually labeling the estimated syllable data, which, however, is impractical. In contrast, we, in this work, propose a method to obtain labels for estimated syllable data without using manual annotation. The proposed method considers a weighted version of the well-known Wagner–Fisher algorithm (WFA) to assign the existing labels to the estimated syllable data, where the weights are computed based on a set of three constraints defined in the proposed algorithm. Experiments on ISLE corpus show that the performance obtained on the test set for four different types of estimated syllable data are higher when the assigned labels and estimated syllable data are used for training compared to those when existing labels and existing syllable data are used. Also, the label assignment accuracy using the proposed method is found to be higher than that using a baseline scheme based on WFA.
Noise robust pitch stylization using minimum mean absolute error criterion
Chiranjeevi Yarra,Prasanta Kumar Ghosh
Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021
@inproceedings{bib_Nois_2021, AUTHOR = {Chiranjeevi Yarra, Prasanta Kumar Ghosh}, TITLE = {Noise robust pitch stylization using minimum mean absolute error criterion}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2021}}
We propose a pitch stylization technique in the presence of pitch halving and doubling errors. The technique uses an optimization criterion based on a minimum mean absolute error to make the stylization robust to such pitch estimation errors, particularly under noisy conditions. We obtain segments for the stylization automatically using dynamic programming. Experiments are performed at the frame level and the syllable level. At the frame level, the closeness of stylized pitch is analyzed with the ground truth pitch, which is obtained using a laryngograph signal, considering root mean square error (RMSE) measure. At the syllable level, the effectiveness of perceptual relevant embeddings in the stylized pitch is analyzed by estimating syllabic tones and comparing those with manual tone markings using the Levenshtein distance measure. The proposed approach performs better than a minimum mean squared error criterion based pitch stylization scheme at the frame level and a knowledgebased tone estimation scheme at the syllable level under clean and 20dB, 10dB and 0dB SNR conditions with five noises and four pitch estimation techniques. Among all the combinations of SNR, noise and pitch estimation techniques, the highest absolute RMSE and mean distance improvements are found to be 6.49Hz and 0.23, respectively. Index Terms: Pitch stylization, minimum MAE criterion, dynamic programming based segmentation, noise robustness
DNN based phrase boundary detection using knowledge-based features and feature representations from CNN
Pavan Kumar J,Chiranjeevi Yarra,Prasanta Kumar Ghosh
National Conference on Communications, NCC, 2021
@inproceedings{bib_DNN__2021, AUTHOR = {Pavan Kumar J, Chiranjeevi Yarra, Prasanta Kumar Ghosh}, TITLE = {DNN based phrase boundary detection using knowledge-based features and feature representations from CNN}, BOOKTITLE = {National Conference on Communications}. YEAR = {2021}}
—Automatic phrase boundary detection could be useful in applications, including computer-assisted pronunciation tutoring, spoken language understanding, and automatic speech recognition. In this work, we consider the problem of phrase boundary detection on English utterances spoken by native American speakers. Most of the existing works on boundary detection use either knowledge-based features or representations learnt from a convolutional neural network (CNN) based architecture, considering word segments. However, we hypothesize that combining knowledge-based features and learned representations could improve the boundary detection task’s performance. For this, we consider a fusion-based model considering deep neural network (DNN) and CNN, where CNNs are used for learning representations and DNN is used to combine knowledge-based features and learned representations. Further, unlike existing data-driven methods, we consider two CNNs for learning representation, one for word segments and another for word-final syllable segments. Experiments on Boston University radio news and Switchboard corpora show the benefit of the proposed fusionbased approach compared to a baseline using knowledge-based features only and another baseline using feature representations from CNN only. Index Terms—Boundary detection, human-computer interaction, computer-assisted pronunciation tutoring, CNN based representation learning
MUCS 2021: Multilingual and code-switching ASR challenges for low resource Indian languages
Anuj Diwan,Rakesh Vaideeswaran,Sanket Shah,Ankita Singh,Srinivasa Raghavan,Shreya Khare,Vinit Unni,Akash Rajpuria,Chiranjeevi Yarra
Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021
@inproceedings{bib_MUCS_2021, AUTHOR = {Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan, Shreya Khare, Vinit Unni, Akash Rajpuria, Chiranjeevi Yarra}, TITLE = {MUCS 2021: Multilingual and code-switching ASR challenges for low resource Indian languages}, BOOKTITLE = {Annual Conference of the International Speech Communication Association}. YEAR = {2021}}
Recently, there is an increasing interest in multilingual automatic speech recognition (ASR) where a speech recognition system caters to multiple low resource languages by taking advantage of low amounts of labelled corpora in multiple languages. With multilingualism becoming common in today’s world, there has been increasing interest in code-switching ASR as well. In code-switching, multiple languages are freely interchanged within a single sentence or between sentences. The success of low-resource multilingual and code-switching (MUCS) ASR often depends on the variety of languages in terms of their acoustics, linguistic characteristics as well as the amount of data available and how these are carefully considered in building the ASR system. In this MUCS 2021 challenge, we would like to focus on building MUCS ASR systems through two different subtasks related to a total of seven Indian languages, namely Hindi, Marathi, Odia, Tamil, Telugu, Gujarati and Bengali. For this purpose, we provide a total of ∼600 hours of transcribed speech data, comprising train and test sets, in these languages, including two code-switched language pairs, Hindi-English and Bengali-English. We also provide baseline recipes1 for both the subtasks with 30.73% and 32.45% word error rate on the MUCS test sets, respectively. Index Terms: Multilingual, Code-switching, low-resource
A STUDY ON NATIVE AMERICAN ENGLISH SPEECH RECOGNITION BY INDIAN LISTENERS WITH VARYING WORD FAMILIARITY LEVEL
Abhayjeet Singh,Achuth Rao MV,Rakesh Vaideeswaran,Chiranjeevi Yarra,Prasanta Kumar Ghosh
Conference of the Oriental COCOSDA, O-COCOSDA, 2021
@inproceedings{bib_A_ST_2021, AUTHOR = {Abhayjeet Singh, Achuth Rao MV, Rakesh Vaideeswaran, Chiranjeevi Yarra, Prasanta Kumar Ghosh}, TITLE = {A STUDY ON NATIVE AMERICAN ENGLISH SPEECH RECOGNITION BY INDIAN LISTENERS WITH VARYING WORD FAMILIARITY LEVEL}, BOOKTITLE = {Conference of the Oriental COCOSDA}. YEAR = {2021}}
In this study, listeners of varied Indian nativities are asked to listen and recognize TIMIT utterances spoken by American speakers. We have three kinds of responses from each listener while they recognize an utterance: 1. Sentence difficulty ratings, 2. Speaker difficulty ratings, and 3. Transcription of the utterance. From these transcriptions, word error rate (WER) is calculated and used as a metric to evaluate the similarity between the recognized and the original sentences. The sentences selected in this study are categorized into three groups: Easy, Medium and Hard, based on the frequency of occurrence of the words in them. We observe that the sentence, speaker difficulty ratings and the WERs increase from easy to hard categories of sentences. We also compare the human speech recognition performance with that using three automatic speech recognition (ASR) under following three combinations of acoustic model (AM) and language model (LM): ASR1) AM trained with recordings from speakers of Indian origin and LM built on TIMIT text, ASR2) AM using recordings from native American speakers and LM built on text from LIBRI speech corpus, and ASR3) AM using recordings from native American speakers and LM build on LIBRI speech and TIMIT text. We observe that HSR performance is similar to that of ASR1 whereas ASR3 achieves the best performance. Speaker nativity wise analysis shows that utterances from speakers of some nativity are more difficult to recognize by Indian listeners compared to few other nativities. Index Terms: human speech recognition, automatic speech recognition
IE-CPS Lexicon: An Automatic Speech Recognition Oriented Indian English Pronunciation Dictionary
Shelly Jain,Aditya Yadavalli,Mirishkar Sai Ganesh,Chiranjeevi Yarra,Anil Kumar Vuppala
International Conference on Natural Language Processing., ICON, 2021
@inproceedings{bib_IE-C_2021, AUTHOR = {Shelly Jain, Aditya Yadavalli, Mirishkar Sai Ganesh, Chiranjeevi Yarra, Anil Kumar Vuppala}, TITLE = {IE-CPS Lexicon: An Automatic Speech Recognition Oriented Indian English Pronunciation Dictionary}, BOOKTITLE = {International Conference on Natural Language Processing.}. YEAR = {2021}}
Indian English (IE), on the surface, seems quite similar to standard English. However, closer observation shows that it has actually been influenced by the surrounding vernacu- lar languages at several levels from phonology to vocabulary and syntax. Due to this, automatic speech recognition (ASR) systems developed for American or British varieties of English result in poor performance on Indian English data. The most prominent feature of Indian English is the characteristic pronunciation of the speakers. The systems are unable to learn these acoustic variations while modelling and cannot parse the non-standard articulation of non-native speakers. For this purpose, we propose a new phone dictionary de- veloped based on the Indian language Com- mon Phone Set (CPS). The dictionary maps the phone set of American English to existing Indian phones based on perceptual similarity. This dictionary is named Indian English Com- mon Phone Set (IE-CPS). Using this, we build an Indian English ASR system and compare its performance with an American English ASR system on speech data of both varieties of En- glish. Our experiments on the IE-CPS show that it is quite effective at modelling the pro- nunciation of the average speaker of Indian En- glish. ASR systems trained on Indian English data perform much better when modelled us- ing IE-CPS, achieving a reduction in the word error rate (WER) of upto 3.95% when used in place of CMUdict. This shows the need for a different lexicon for Indian English.
A Robust Speaking Rate Estimator Using a CNN-BLSTM Network
Aparna Srinivasan,Diviya Singh,Chiranjeevi Yarra,Aravind Illa,Prasanta Kumar Ghosh
Circuits, Systems, and Signal Processing, CSSP, 2021
@inproceedings{bib_A_Ro_2021, AUTHOR = {Aparna Srinivasan, Diviya Singh, Chiranjeevi Yarra, Aravind Illa, Prasanta Kumar Ghosh}, TITLE = {A Robust Speaking Rate Estimator Using a CNN-BLSTM Network}, BOOKTITLE = {Circuits, Systems, and Signal Processing}. YEAR = {2021}}
Direct acoustic feature-based speaking rate estimation is useful in applications including pronunciation assessment, dysarthria detection and automatic speech recognition. Most of the existing works on speaking rate estimation have steps which are heuristically designed. In contrast to the existing works, in this work a data-driven approach with convolutional neural network-bidirectional long short-term memory (CNN-BLSTM) network is proposed to jointly optimize all steps in speaking rate estimation through a single framework. Also, unlike existing deep learning-based methods for speaking rate estimation, the proposed approach estimates the speaking rate for an entire speech utterance in one go instead of considering segments of a fixed duration. We consider the traditional 19 sub-band energy (SBE) contours as the low-level features as the input of the proposed CNN-BLSTM network. The state-of-the-art direct acoustic feature-based speaking rate estimation techniques are developed based on 19 SBEs as well. Experiments are performed separately using three native English speech corpora (Switchboard, TIMIT and CTIMIT) and a non-native English speech corpus (ISLE). Among these, TIMIT and Switchboard are used for training the network. However, testing is carried out on all the four corpora as well as TIMIT and Switchboard with additive noise, namely white, car, high-frequency-channel, cockpit, and babble at 20, 10 and 0 dB signal-to-noise ratios. The proposed CNN-BLSTM approach outperforms the best of the existing techniques in clean as well as noisy conditions for all four corpora.