3260 papers • 126 benchmarks • 313 datasets
Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.
(Image credit: Papersgraph)
These leaderboards are used to track progress in visual-entailment-6
Use these libraries to find visual-entailment-6 models and implementations
No subtasks available.
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM.
OFA is proposed, a Task-Agnostic and Modality- agnostic framework that supports Task Comprehensiveness and achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni- modal tasks.
It is shown that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown, and achieves competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V & L Navigation tasks.
This paper proposes SOHO to "Seeing Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner, and does not require bounding box annotations which enables inference 10 times faster than region-based approaches.
To enable large-scale training, VILLA adopts the "free" adversarial training strategy, and combines it with KL-divergence-based regularization to promote higher invariance in the embedding space.
A new task of Chart Caption Factual Error Correction is established and CHARTVE, a model for visual entailment that outperforms proprietary and open-source LVLMs in evaluating factual consistency is introduced, and an interpretable two-stage framework that excels at correcting factual errors is proposed.
DiDE, a framework that distills the knowledge of the fusion-encoder teacher model into the dual-encoding student model, and encourages the student not only to mimic the predictions of teacher, but also to calculate the cross-modal attention distributions and align with the teacher.
EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.
Adding a benchmark result helps the community track progress.