Abstract
India’s rich linguistic diversity, encompassing 22 scheduled languages and thousands of regional varieties,
poses significant communication challenges across domains such as education, healthcare, governance,
and digital access. Addressing this linguistic divide is crucial for fostering social and economic inclusion.
Speech-to-Speech Machine Translation (SSMT) offers a promising solution, enabling seamless multilingual
communication. However, developing SSMT for Indian languages is challenging due to the scarcity of
corpora and the complexity of linguistic structures.
This work adopts a cascaded SSMT approach, integrating automatic speech recognition (ASR), machine
translation (MT), and text-to-speech (TTS), with a focus on overcoming the challenges of error propaga-
tion between components to enhance translation quality. The primary focus is on enhancing the machine
translation component while addressing critical gaps between automatic speech recognition and MT. These
gaps include missing punctuation, spoken disfluencies, and grammatical inconsistencies, etc. We also ensure
effective integration with text-to-speech systems by automatically generating subtitle text that aligns with
video/audio content based on the source language.
To achieve these goals, we proposed and developed a speech-to-speech or video-to-video translation
pipeline incorporating several pre-processing and post-processing tools to bridge the above discussed gaps.
These tools include punctuation restoration, disfluency detection, domain and domain-term identification,
automatic translated subtitle generation, and video syncing, etc., supporting both English and Indian lan-
guages. To improve MT quality across English and multiple Indian languages, we explored linguistic features
such as part-of-speech (POS) tags, chunk tags, and morphology, with a particular emphasis on low-resource
languages. We addressed the scarcity of parallel corpora for 36×36 language directions, including English and
35 Indian languages, by employing advanced data augmentation techniques such as iterative back-translation,
discourse back-translation, COMET-based filtering, and pivot translation, resulting in the creation of over 10
billion parallel sentences. Additionally, we developed domain-specific human-translated parallel corpora
tailored for general, educational, and healthcare domains. We also addressed translation evaluation by creating
millions of data for direct machine translation evaluation and used them to develop both reference-free and
reference-based evaluation metrics, leveraging encoder-decoder and decoder-only models. We also explored
error identification and automatic post-editing as tasks related to MT. Our work further encompasses subtitle
translation and video-to-video synchronization, ensuring precise alignment between translated subtitles and
video frames.A significant contribution of this work is the development of a single multitask model with 2.5 billion
parameters, integrating translation, discourse machine translation, evaluation, error detection, and post-editing
functionalities. Additionally, we designed a decoder-only large language model tailored for English and 35
low-resource languages as a foundational model, which we further fine-tuned for MT and related tasks. We
developed benchmark datasets for Speech-to-Speech Machine Translation (SSMT), disfluency translation,
discourse machine translation, and domain-specific machine translation. Through these efforts, we present a
scalable and robust SSMT system that fosters linguistic inclusivity and bridges communication gaps across
India’s diverse languages. All our contributions, including corpora, models, and related resources, are made
openly available to the community to facilitate further research and innovation.