Joint Research Efforts Aim To Make Document Image Analysis More Fluent And Interactive

Researchers at the Centre for Visual Information Technology (CVIT), IIITH in collaboration with the Computer Vision Centre, Barcelona have been trying to create new frontiers in the visual question answering domain through a series of projects and challenges. 

While technology has definitely evolved over the years, there’s still no easy way for computers to read and recognize information that is in the form of an image, such as a scanned document or a picture of a street sign. In fact, that explains why a huge chunk of Spam emails consist of visual ads: spammers typically embed a bit of text inside images in a bid to circumvent spam-blocking software.

OCR
The late ‘60s and early ‘70s witnessed the introduction of a system recognising images of text in any font – handwritten, printed or typed. This sort of a document image analysis, more popularly known as optical character recognition (OCR), is one of the oldest application fields of artificial intelligence combining both vision and language understanding. A lot of current research in document image analysis is motivated by it but so far, all efforts have focused on creating models that extract information from images in a “bottom-up” approach. That is, whether it is detecting tables and diagrams or extracting handwritten fields from predefined forms, OCR engines are disconnected from the final use of information. Not only are most of these systems designed to work offline with little to no human interaction, they also tend to focus only on ‘conversion’ of documents into digital formats, rather than really ‘understanding’ the message contained in the documents. For instance OCR technology is used at airports to automatically ‘read’ passport information, but cannot interact with the immigration official who may want more complex information extracted from the passport.

Collaborative Efforts
“As a community, we are now looking for a superior understanding beyond just recognition,” says Prof. Jawahar. In a systematic step towards this with immediate applications, a collaborative effort has been underway between the Centre for Visual Information Technology (CVIT), IIITH and the Computer Vision Centre (CVC) at the Autonomous University of Barcelona with support from an Amazon AWS Machine Learning Award. The team comprising Prof. C. V. Jawahar, head of CVIT with researcher Minesh Mathew and Dr. Dimosthenis Karatzas, Associate Director of CVC has been engaged in creating a system that can initiate a dialogue with different forms of written text such as that in a document, a book, an annual report, a comic strip and so on. Known as Document Visual Question Answering, research here is centred around guiding machines to understand human requests to respond appropriately to them, and eventually in real time. Questions could range from very simple ones such as asking the system to identify what is in the image (for instance, is it a person, an animal, a food item, and so on), to the more complex – identifying persons in it (If it is a celebrity, the system is asked to identify who the celebrity is) and what is happening in the image.

DocVQA Challenge Series
According to Dr. Karatzas, the long-term aim is to push for a change of paradigm in the document analysis and recognition community to come up with new ways of doing things. “The idea is to use methods that condition the information extraction to the high-level task defined by the user in the form of a natural language question, while maintaining a human friendly interface,” he says. The first steps towards this goal have been to frame the problem correctly and to define data and benchmarks in order to measure progress. The DocVQA Challenge Series (https://rrc.cvc.uab.es/?ch=17) was born as a result of the first year of work.

So far, the team has set up three challenges that look at increasingly more difficult facets of the problem. They began with defining a large-scale, broad dataset of 12,000 document images with 50,000 question-and-answer pairs. When the dataset was released, the first challenge was organized prior to the 2020 Conference on Computer Vision and Pattern Recognition (CVPR), covering a wide range of question types. Next, the researchers moved to asking questions over a set of document images – a whole collection as opposed to a single one. This year’s DocVQA challenge which is currently underway sees the introduction of an even more challenging task – a VQA on Infographics, where textual information is intrinsically linked with graphical elements to create complex layouts telling stories based on data. For this, a new dataset with emphasis on data visualizations has been created. ““The DocVQA web portal is quickly becoming the de-facto benchmark for this task, as researchers use it daily to evaluate new ideas, models and methods. To this date, we have evaluated ~700 submissions to the first two challenges, out of which more than 60 have been made public by their authors and feature in the ranking tables”, says Dr. Karatzas. In the context of the upcoming International Conference on Document Analysis and Recognition (ICDAR) 2021, there will be two workshops organised by researchers involved in this project; one on ‘Document Visual Question Answer’ (1st edition,  http://docvqa.org/static/workshop.html) and another on ‘Human Document Interaction’ (3rd edition,  https://grce.labri.fr/HDI/index.html).

Everyday AI
Speaking of the multidisciplinary nature of current day AI, Prof. Jawahar says, “Computer Vision, Natural Language processing and Speech Processing are converging. Problems are multi-modal and targets are closer to human perception.” He goes on to add, “The day is not too far when one can use the mobile phone camera as a tool to interact with the world. You could point it at a plate of food, enquire in your own voice if it’s gluten-free or keto-friendly and based on the response elicited determine if it’s safe to eat!”