3260 papers • 126 benchmarks • 313 datasets
MLLM Leaderboard
(Image credit: Papersgraph)
These leaderboards are used to track progress in visual-question-answering
Use these libraries to find visual-question-answering models and implementations
The proposed Grad-CAM technique uses the gradients of any target concept flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept, and shows that even non-attention based models learn to localize discriminative regions of input image.
A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ∼\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim $$\end{document}0.25 M images, ∼\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim $$\end{document}0.76 M questions, and ∼\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim $$\end{document}10 M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).
BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods, and is demonstrated's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
This work shows how a deep learning architecture equipped with an RN module can implicitly discover and learn to reason about entities and their relations.
This paper presents LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding and introduces GPT-4 generated visual instruction tuning data, the model and code base publicly available.
This model, while being architecturally simple and relatively small in terms of trainable parameters, sets a new state of the art on both unbalanced and balanced VQA benchmark.
ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language, is presented, extending the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.
In this paper, we explain the mechanism of bilinear pooling as a module of hard sample generation, and find that bilinear pooling significantly expands variances of the first-order vectors when it produces discriminative bilinear features. In conjunction with the extremely high dimensionality of the obtained bilinear features, those variances lead to overfitting in subsequent learning models. To solve this issue, we construct a bi-level optimization problem, where the high-level problem is the supervised classification loss, and the low-level problem is the principal component analysis (PCA). Then, we find that PCA on bilinear features is equivalent to spectral clustering, which allows us to mathematically prove that the first <inline-formula><tex-math notation="LaTeX">$\log _{2}(C)$</tex-math><alternatives><mml:math><mml:mrow><mml:msub><mml:mo form="prefix">log</mml:mo><mml:mn>2</mml:mn></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>C</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math><inline-graphic xlink:href="han-ieq1-3601355.gif"/></alternatives></inline-formula> principal components can support the discriminant information of <inline-formula><tex-math notation="LaTeX">$C$</tex-math><alternatives><mml:math><mml:mi>C</mml:mi></mml:math><inline-graphic xlink:href="han-ieq2-3601355.gif"/></alternatives></inline-formula> classes. By removing the rest principal components, the dimensionality and variances are simultaneously reduced. To the best of our knowledge, this is the first work providing a lower bound for dimension reduction for bilinear pooling. However, the PCA projection matrix <inline-formula><tex-math notation="LaTeX">$\mathbf{L}$</tex-math><alternatives><mml:math><mml:mi mathvariant="bold">L</mml:mi></mml:math><inline-graphic xlink:href="han-ieq3-3601355.gif"/></alternatives></inline-formula> is prone to overfitting due to having many parameters. To address this issue, we propose a rank-<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="han-ieq4-3601355.gif"/></alternatives></inline-formula> general bilinear projection (RK-GBP) that decomposes <inline-formula><tex-math notation="LaTeX">$\mathbf{L}$</tex-math><alternatives><mml:math><mml:mi mathvariant="bold">L</mml:mi></mml:math><inline-graphic xlink:href="han-ieq5-3601355.gif"/></alternatives></inline-formula> into two small matrices <inline-formula><tex-math notation="LaTeX">$\mathbf{U}$</tex-math><alternatives><mml:math><mml:mi mathvariant="bold">U</mml:mi></mml:math><inline-graphic xlink:href="han-ieq6-3601355.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$\mathbf{V}$</tex-math><alternatives><mml:math><mml:mi mathvariant="bold">V</mml:mi></mml:math><inline-graphic xlink:href="han-ieq7-3601355.gif"/></alternatives></inline-formula>, whose learnable parameters are smaller. Different from traditional bilinear projections used in factorized bilinear pooling (FBiP), our RK-GBP can preserve the orthogonality of columns in <inline-formula><tex-math notation="LaTeX">$\mathbf{L}$</tex-math><alternatives><mml:math><mml:mi mathvariant="bold">L</mml:mi></mml:math><inline-graphic xlink:href="han-ieq8-3601355.gif"/></alternatives></inline-formula> by constraining the orthogonality of columns in <inline-formula><tex-math notation="LaTeX">$\mathbf{U}$</tex-math><alternatives><mml:math><mml:mi mathvariant="bold">U</mml:mi></mml:math><inline-graphic xlink:href="han-ieq9-3601355.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$\mathbf{V}$</tex-math><alternatives><mml:math><mml:mi mathvariant="bold">V</mml:mi></mml:math><inline-graphic xlink:href="han-ieq10-3601355.gif"/></alternatives></inline-formula>. For computational efficiency, we relax the PCA in the low-level task into a dictionary learning problem, obtaining the rank-<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="han-ieq11-3601355.gif"/></alternatives></inline-formula> orthogonal factorization bilinear pooling (RK-OFBP). The RK-OFBP can be considered as a general form of current factorization bilinear pooling methods (e.g., Hadamard product-based ones). Finally, we evaluate our approach on fine-grained images and large-scale datasets, demonstrating that our proposed method not only produces extremely low-dimensional features but also outperforms other methods in classification tasks. For example, our RK-OFBP can employ 32-dimensional vectors to achieve comparable results to B-CNN (Lin, 2015) (dimension: 512*512) for the 200-class classification task.
The new DMN+ model improves the state of the art on both the Visual Question Answering dataset and the \babi-10k text question-answering dataset without supporting fact supervision.
Adding a benchmark result helps the community track progress.