3260 papers • 126 benchmarks • 313 datasets
Question Answering task on charts images
(Image credit: Papersgraph)
These leaderboards are used to track progress in chart-question-answering-21
Use these libraries to find chart-question-answering-21 models and implementations
No subtasks available.
Pix2Struct is presented, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language and introduced a variable-resolution input representation and a more flexible integration of language and vision inputs.
A novel framework that leverages Structured Triplet Representations (STR) to achieve a unified and label-efficient approach to chart perception and reasoning tasks, which is generally applicable to different downstream tasks, beyond the question-answering task as specifically studied in peer works.
PaLI-X, a multilingual vision and language model, advances the state-of-the-art on most vision-and-language benchmarks considered and observes emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
The Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks.
This work introduces ScreenAI, a vision-language model that specializes in UI and infographics understanding that improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets.
FigureQA is envisioned as a first step towards developing models that can intuitively recognize patterns from visual representations of data, and preliminary results indicate that the task poses a significant machine learning challenge.
DVQA is presented, a dataset that tests many aspects of bar chart understanding in a question answering framework and two strong baselines are proposed that perform considerably better than current VQA algorithms.
This work proposes a novel CQA algorithm called parallel recurrent fusion of image and language (PReFIL), which first learns bimodal embeddings by fusing question and image features and then intelligently aggregates these learnedembeddings to answer the given question.
This work proposes a new model that jointly learns classification and regression for chart question answering, and uses co-attention transformers to capture the complex real-world interactions between the question and the textual elements.
This work presents two transformer-based models that combine visual features and the data table of the chart in a unified way to answer questions and achieves the state-of-the-art results on the previous datasets as well as on the benchmark.
Adding a benchmark result helps the community track progress.