3260 papers • 126 benchmarks • 313 datasets
Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.
(Image credit: Papersgraph)
These leaderboards are used to track progress in visual-entailment-4
Use these libraries to find visual-entailment-4 models and implementations
No subtasks available.
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM.
It is shown that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown, and achieves competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V & L Navigation tasks.
OFA is proposed, a Task-Agnostic and Modality- agnostic framework that supports Task Comprehensiveness and achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni- modal tasks.
To enable large-scale training, VILLA adopts the "free" adversarial training strategy, and combines it with KL-divergence-based regularization to promote higher invariance in the embedding space.
DiDE, a framework that distills the knowledge of the fusion-encoder teacher model into the dual-encoding student model, and encourages the student not only to mimic the predictions of teacher, but also to calculate the cross-modal attention distributions and align with the teacher.
This paper proposes SOHO to "Seeing Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner, and does not require bounding box annotations which enables inference 10 times faster than region-based approaches.
A new task of Chart Caption Factual Error Correction is established and CHARTVE, a model for visual entailment that outperforms proprietary and open-source LVLMs in evaluating factual consistency is introduced, and an interpretable two-stage framework that excels at correcting factual errors is proposed.
EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.
Adding a benchmark result helps the community track progress.