Visual Question Answering

VQA: Visual Question Answering

C. L. Zitnick, Devi Parikh, Dhruv Batra, Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell•Sat May 02 2015

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ∼\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim $$\end{document}0.25 M images, ∼\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim $$\end{document}0.76 M questions, and ∼\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim $$\end{document}10 M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).

6134 0

Paper Graph

Learning Compact Discriminant Representation via Low-Rank Bilinear Pooling

Junwei Han, Kun Song, Hao Li, Gong Cheng, Feiping Nie, Bin Gu, Fakhri Karray•Wed Aug 20 2025

In this paper, we explain the mechanism of bilinear pooling as a module of hard sample generation, and find that bilinear pooling significantly expands variances of the first-order vectors when it produces discriminative bilinear features. In conjunction with the extremely high dimensionality of the obtained bilinear features, those variances lead to overfitting in subsequent learning models. To solve this issue, we construct a bi-level optimization problem, where the high-level problem is the supervised classification loss, and the low-level problem is the principal component analysis (PCA). Then, we find that PCA on bilinear features is equivalent to spectral clustering, which allows us to mathematically prove that the first <inline-formula><tex-math notation="LaTeX">$\log _{2}(C)$</tex-math><alternatives><mml:math><mml:mrow><mml:msub><mml:mo form="prefix">log</mml:mo><mml:mn>2</mml:mn></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>C</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math><inline-graphic xlink:href="han-ieq1-3601355.gif"/></alternatives></inline-formula> principal components can support the discriminant information of <inline-formula><tex-math notation="LaTeX">$C$</tex-math><alternatives><mml:math><mml:mi>C</mml:mi></mml:math><inline-graphic xlink:href="han-ieq2-3601355.gif"/></alternatives></inline-formula> classes. By removing the rest principal components, the dimensionality and variances are simultaneously reduced. To the best of our knowledge, this is the first work providing a lower bound for dimension reduction for bilinear pooling. However, the PCA projection matrix <inline-formula><tex-math notation="LaTeX">$\mathbf{L}$</tex-math><alternatives><mml:math><mml:mi mathvariant="bold">L</mml:mi></mml:math><inline-graphic xlink:href="han-ieq3-3601355.gif"/></alternatives></inline-formula> is prone to overfitting due to having many parameters. To address this issue, we propose a rank-<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="han-ieq4-3601355.gif"/></alternatives></inline-formula> general bilinear projection (RK-GBP) that decomposes <inline-formula><tex-math notation="LaTeX">$\mathbf{L}$</tex-math><alternatives><mml:math><mml:mi mathvariant="bold">L</mml:mi></mml:math><inline-graphic xlink:href="han-ieq5-3601355.gif"/></alternatives></inline-formula> into two small matrices <inline-formula><tex-math notation="LaTeX">$\mathbf{U}$</tex-math><alternatives><mml:math><mml:mi mathvariant="bold">U</mml:mi></mml:math><inline-graphic xlink:href="han-ieq6-3601355.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$\mathbf{V}$</tex-math><alternatives><mml:math><mml:mi mathvariant="bold">V</mml:mi></mml:math><inline-graphic xlink:href="han-ieq7-3601355.gif"/></alternatives></inline-formula>, whose learnable parameters are smaller. Different from traditional bilinear projections used in factorized bilinear pooling (FBiP), our RK-GBP can preserve the orthogonality of columns in <inline-formula><tex-math notation="LaTeX">$\mathbf{L}$</tex-math><alternatives><mml:math><mml:mi mathvariant="bold">L</mml:mi></mml:math><inline-graphic xlink:href="han-ieq8-3601355.gif"/></alternatives></inline-formula> by constraining the orthogonality of columns in <inline-formula><tex-math notation="LaTeX">$\mathbf{U}$</tex-math><alternatives><mml:math><mml:mi mathvariant="bold">U</mml:mi></mml:math><inline-graphic xlink:href="han-ieq9-3601355.gif"/></alternatives></inline-formula> and <inline-formula><tex-math notation="LaTeX">$\mathbf{V}$</tex-math><alternatives><mml:math><mml:mi mathvariant="bold">V</mml:mi></mml:math><inline-graphic xlink:href="han-ieq10-3601355.gif"/></alternatives></inline-formula>. For computational efficiency, we relax the PCA in the low-level task into a dictionary learning problem, obtaining the rank-<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="han-ieq11-3601355.gif"/></alternatives></inline-formula> orthogonal factorization bilinear pooling (RK-OFBP). The RK-OFBP can be considered as a general form of current factorization bilinear pooling methods (e.g., Hadamard product-based ones). Finally, we evaluate our approach on fine-grained images and large-scale datasets, demonstrating that our proposed method not only produces extremely low-dimensional features but also outperforms other methods in classification tasks. For example, our RK-OFBP can employ 32-dimensional vectors to achieve comparable results to B-CNN (Lin, 2015) (dimension: 512*512) for the 200-class classification task.

1 0

Paper Graph

Visual Question Answering | State-of-the-Art

Benchmarks

Libraries

Datasets

Subtasks

Most implemented papers

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Content

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

VQA: Visual Question Answering

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

A simple neural network module for relational reasoning

Visual Instruction Tuning

Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Learning Compact Discriminant Representation via Low-Rank Bilinear Pooling

Dynamic Memory Networks for Visual and Textual Question Answering