3260 papers • 126 benchmarks • 313 datasets
Document image classification is the task of classifying documents based on images of their contents. ( Image credit: Real-Time Document Image Classification using Deep CNN and Extreme Learning Machines )
(Image credit: Papersgraph)
These leaderboards are used to track progress in document-image-classification
Use these libraries to find document-image-classification models and implementations
No subtasks available.
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
This work produces a competitive convolution-free transformer by training on Imagenet only and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
The LayoutLM is proposed to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents.
A self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers, is introduced, and results on image classification and semantic segmentation show that the model achieves competitive results with previous pre-training methods.
The LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUND dataset and aims to bridge the language barriers for visually-rich document understanding.
An exhaustive investigation of recent Deep Learning architectures, algorithms, and strategies for the task of document image classification to finally reduce the error by more than half is presented.
LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks.
A novel OCR-free VDU model named Donut, which stands for Document understanding transformer, achieves state-of-the-art performances on various VD U tasks in terms of both speed and accuracy and offers a synthetic data generator that helps the model pre-training to be flexible in various languages and domains.
The proposed region-based Deep Convolutional Neural Network framework for document structure learning achieves state-of-the-art accuracy of 92.21% on the popular RVL-CDIP document image dataset, exceeding the benchmarks set by the existing algorithms.
Adding a benchmark result helps the community track progress.