DIA-Tribe Project Summary

This project aims at building a fully open experimental environment for both academic and corporate research in Document Image Analysis (DIA). It will provide both the technological and methodological tools for making experimental research in this domain much more reproducible than it is now. Document Image Analysis is a discipline that focuses on extracting high-level information from unstructured digital documents. These documents are generally obtained through either controlled scanning processes or uncontrolled hand-held device capturing and may even be resulting from video. It is at the crossroads of Signal and Image Processing, Knowledge Modeling, Information Retrieval and Machine Learning. Its main application domains are defined by corporate or societal needs of extracting, correlating and using information embedded in complex flows of data, part of which comes from uncontrollable and difficult to model sources.

One of the major shortcomings of the current state-of-the-art is that the mentioned end-to-end processing chains have been left to the economic and industrial stakeholders. There is currently no available open academic reference framework to elaborate or test such chains. Currently available reference sets and benchmarking tools only focus on specific and contextually constraint sub-problems. The specifics of the collected data set we leverage in this proposal, and the framework we will develop is going to provide an open academic environment correcting this bias. Public funding without industrial or corporate potential conflict of interest is essential to this goal, since private stakeholders have no interest in funding research initiatives that can potentially expose their strengths or weaknesses to the broader community.

Besides investigating the technological and methodological tools for making experimental research in this domain much more reproducible than it is now and focusing on building upon an international community for leveraging a large-scale adoption and impact in the DIA domain, the project will also provide a keen insight on how experimental machine perception performance evaluation needs to be done to be fully reproducible and as such set the stage for a broader application of its findings in other domains.

The main drive behind this is that current state-of-the-art in DIA and its associated research community struggle with a genuine difficulty of correctly assessing their progress as a scientific discipline, although, unmistakably, the technical achievements and available methods over the last decades clearly show great improvements and production-ready tools for large scale automated document processing. This essentially comes from the fact that most of its research is very much application driven. While the major advantage of being application driven is that research is very much in phase with the needs of the digital document processing industry, the drawback is that actual scientifically objective comparison between published results is not allays straightforward for two main reasons: first, very focused problem solving leads to segmented niche sub-foci that barely share any context with other subjects; second, data collections required for open peer validation are very often impossible to get by, since they belong to economic stakeholders that cannot release them easily (because of IP or competitive reasons, or because of privacy sensitive data). This research proposal aims to create the French roots of a future international consortium that will address the following issues:

  • provide a long-term, sustainable framework for benchmarking and evaluating DIA methods,
  • process, store and distribute large reference data sets for DIA evaluation,
  • provide tools to annotate, extend and organize DIA data sets for use as benchmarking,
  • provide automated tools to classify document collections and extract information for indexing and searching,
  • organize international scientific events to promote the use and extend the scope of the benchmarking tools to the broader DIA community.