Abstract
This doctoral research presents a comprehensive and multi-faceted investigation into automated table
understanding, arguing that robust and reliable solutions demand an approach that evolves from foundational structural parsing to address the pressing real-world requirements of data privacy and auditable
reasoning. Tables are information-rich, structured objects that serve as a cornerstone for conveying
complex data, yet their automated parsing is a formidable, long-standing challenge in document intelligence. The core of this challenge lies in Table Structure Recognition (TSR), the process of transforming
a table image into a structured, machine-readable format. The difficulty is rooted in the immense visual diversity of tables, with complexities such as spanning cells, multi-line text, and the absence of
ruling lines often causing traditional and early deep learning methods to fail. This body of work charts
a clear research trajectory that begins with the development of a novel framework for TSR, TabStructNet, and progressively refines this methodology to handle increasing visual complexity. The research
then pivots to address critical non-functional requirements, pioneering TabGuard, a novel framework for
privacy-preserving TSR, and finally extends its scope from structure to trusted reasoning by introducing EviFiVQA, a benchmark for financial Visual Question Answering (VQA) that establishes evidence
localization as a core tenet of auditable AI. This journey from pixels to privacy and proof marks a significant contribution to the field.
The core methodology advanced throughout this research is anchored in a powerful two-step paradigm
that mirrors human cognitive processes: a top-down decomposition followed by a bottom-up reconstruction. In the top-down phase, the table image is decomposed into its fundamental constituent parts—the
individual table cells—through an object detection model. In the bottom-up phase, the global table
structure is reconstructed by learning the spatial and logical associations between the detected cells. A
cornerstone of this research is the novel insight that TSR performance can be dramatically improved by
encoding human intuition about table structure directly into the learning objective. This was achieved
through a series of innovative, cognitive-inspired loss functions that act as structural regularizers, including an Alignment Loss to enforce a grid-like structure, a Continuity Loss to ensure adjacent cell
boundaries are contiguous, and an Overlapping Loss to penalize spatial conflicts. This approach is
marked by a clear architectural evolution, beginning with TabStruct-Net, which combined a modified
Mask R-CNN with a Dynamic Graph Convolutional Neural Network, and culminating in TabStructNet V2, which introduced a Hierarchical Local-Attention Vision Transformer (HLVIT) backbone and a
highly efficient self-attention layer to achieve state-of-the-art performance and scalability.Building upon the foundation of accurate structural recognition, the research pivots to address a critical
real-world barrier to the adoption of document AI: data privacy. The research introduces TabGuard, a
pioneering framework that enables privacy-preserving TSR through a novel client-server architecture.
In this model, the client performs content masking locally, blacking out all text regions before sending
only the anonymized image to the server for structure recognition. This effectively decouples structural
understanding from content recognition, ensuring sensitive data is never exposed. The key technical
innovation that makes this possible is the Table Grid Approximator (TGA), a server-side module that
analyzes the spatial distribution of masked text contours to infer a coarse structural grid. This grid serves
as a powerful inductive bias, allowing for the dynamic generation of high-quality, layout-specific anchor boxes that enable the detection network to converge accurately even without access to text content.
The success of TabGuard provides compelling evidence that a table’s structure can be robustly inferred
purely from its spatial layout cues, independent of its content.
The final thrust of this research extends beyond structural parsing to the higher-level goal of building
trustworthy AI systems for semantic document understanding, explored in the high-stakes domain of
financial analysis. Identifying a critical gap in existing VQA benchmarks, which lack support for the
complex numerical and hierarchical reasoning required for financial documents, the EviFiVQA benchmark was created. Its most significant contribution is the requirement of evidence localization: for
each question, a model must produce not only the correct answer but also the bounding boxes of all
table cells used to compute it. This feature transforms VQA from a black-box task into a transparent
and interpretable one, providing a direct mechanism for auditability by allowing an ana