3260 papers • 126 benchmarks • 313 datasets
Generating answers in free form to questions posed about images.
(Image credit: Papersgraph)
These leaderboards are used to track progress in generative-visual-question-answering-7
Use these libraries to find generative-visual-question-answering-7 models and implementations
This work introduces Flamingo, a family of Visual Language Models (VLM) with this ability to bridge powerful pretrained vision-only and language-only models, handle sequences of arbitrarily interleaved visual and textual data, and seamlessly ingest images or videos as inputs.
BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods, and is demonstrated's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
This study reframe the problem of MedVQA as a generation task that naturally follows the human-machine interaction and proposes a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model.
A novel generative model enhanced by multi-modal prompt retrieval (MPR) that integrates retrieved prompts and multimodal features to generate answers in free text that enables rapid zero-shot dataset adaptation to unseen data distributions and open-set answer labels across datasets.
Adding a benchmark result helps the community track progress.