Abstract
Automatic Speech Recognition (ASR) has witnessed rapid progress with the advent of self-supervised
learning and large-scale multilingual models. However, ASR for morphologically complex and lowresource languages remains challenging due to limited labeled data, orthographic inconsistencies, tokenization difficulties, and the computational demands of large-scale systems. Perso-Arabic languages
such as Persian, Arabic, and Urdu exemplify these difficulties, where script variations, rich morphology, complex word formation, and inconsistent text normalization complicate both preprocessing and
subword modeling for ASR systems.
Current state-of-the-art approaches largely rely on scaling model size and data volume, often overlooking the importance of linguistic structure, script-aware tokenization, and domain relevance. For
low-resource settings, scale-driven strategies are computationally prohibitive and frequently suboptimal.
This thesis investigates whether efficient and competitive ASR performance can instead be achieved
through linguistically informed preprocessing, morphologically-aware tokenization, and targeted crosslingual continual pretraining, without dependence on excessively large models.
We first introduce a unified script-aware preprocessing pipeline tailored to Perso-Arabic languages,
addressing normalization, tokenization, phonetic parsing, and text cleaning to reduce orthographic noise
and improve data consistency. Building upon this foundation, we construct a scalable multilingual unlabeled speech corpus of approximately 3,000 hours across Persian, Arabic, and Urdu. Using this corpus,
we systematically evaluate continual pretraining strategies under different initialization conditions and
integrate SentencePiece-based subword modeling within a CTC-based Wav2Vec framework, carefully
adapting token segmentation to handle word boundaries and repeated character collapse. All downstream fine-tuning and evaluation are conducted in a monolingual setting to isolate transfer effects.
Our results demonstrate that a 300M parameter model, when adapted through targeted continual
pretraining and morphologically-aware tokenization, achieves performance competitive with systems
over five times larger. Notably, the proposed model outperforms Whisper Large v3 on Persian and yields
strong results on Arabic and Urdu despite using substantially fewer parameters and labeled resources.
These findings challenge the prevailing assumption that ASR quality scales primarily with model
size. Instead, we show that data relevance, adaptation strategy, and linguistically informed tokenization
play a more critical role in low-resource scenarios. This thesis provides a practical and resource-efficient
pathway toward competitive ASR for Perso-Arabic languages and offers a reproducible experimental
template that may inform future multilingual extensions.