Abstract
With over 1.4 billion people and dozens of widely used scripts, India represents one of the world's most linguistically and visually diverse document ecosystems. Yet document understanding benchmarks remain largely Western- and English-centric. Consequently, the robustness of modern vision-language models (VLMs) in multilingual and low-resource settings remains under-evaluated. We introduce the first large-scale multilingual benchmark for Indian document understanding, comprising 52607 documents across 23 languages, 13 domains, 71 document classes, and 52 tasks, with over 2.4M QA annotations across multiple tasks. The dataset captures diverse layouts, scripts, and visual conditions spanning governance, education, finance, archival, and informal content, among others. We additionally include synthetic multilingual charts, tables, and diagrams generated via Patram-Syn, an Indian persona-driven data generation pipeline with human validation. The benchmark measures perception-level capabilities such as OCR, parsing, key information extraction, and layout detection, as well as reasoning-level capabilities including Document VQA with abstractive, multi-hop, and ambiguous questions. We further introduce targeted settings for cross-lingual, transliterated, and code-mixed VQA. We also propose two new evaluation metrics, Document Understanding Cross-Lingual Index (DUCLI) and Script Robustness Score (SRS), to quantify cross-lingual reasoning degradation and script robustness. Our benchmark exposes consistent weaknesses in both open-source and closed-source VLMs, particularly in low-resource languages, degraded scans, and layout-intensive settings, underscoring fundamental challenges in scalable multilingual document intelligence.