3260 papers • 126 benchmarks • 313 datasets
Acquire the meaning of language in situated environments.
(Image credit: Papersgraph)
These leaderboards are used to track progress in grounded-language-learning-5
No benchmarks available.
Use these libraries to find grounded-language-learning-5 models and implementations
No subtasks available.
A contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning and proposes momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
This work considers two pragmatic modeling methods for text generation: one where pragmatics is imposed by information preservation, and another where prag matics isimposed by explicit modeling of distractors.
VisCOLL, a visually grounded language learning task, which simulates the continual acquisition of compositional phrases from streaming visual scenes, and reveals that SoTA continual learning approaches provide little to no improvements on VisCOLL.
The results demonstrate that deep neural networks can exploit meta-learning, episodic memory and an explicitly multi-modal environment to account for 'fast-mapping', a fundamental pillar of human cognitive development and a potentially transformative capacity for agents that interact with human users.
An agent is presented that learns to interpret language in a simulated 3D environment where it is rewarded for the successful execution of written instructions and its comprehension of language extends beyond its prior experience, enabling it to apply familiar language to unfamiliar situations and to interpret entirely novel instructions.
A joint imitation and reinforcement approach for grounded language learning through an interactive conversational game is proposed and the agent trained is able to actively acquire information by asking questions about novel objects and use the just-learned knowledge in subsequent conversations in a one-shot fashion.
A model that can learn to follow new instructions given prior instruction-perception-action examples and achieves the best results to date on the SAIL dataset by using an improved perceptual component that can represent relative positions of objects.
It is demonstrated that a multilingual model can be trained equally well on either translations or comparable sentence pairs, and that annotating the same set of images in multiple language enables further improvements via an additional caption-caption ranking objective.
This paper attempt at learning explicit latent semantic annotations from paired structured tables and texts, establishing correspondences between various types of values and texts with an adapted semi-hidden Markov model.
Adding a benchmark result helps the community track progress.