3260 papers • 126 benchmarks • 313 datasets
Generate referring expressions
(Image credit: Papersgraph)
These leaderboards are used to track progress in referring-expression-generation-8
No benchmarks available.
Use these libraries to find referring-expression-generation-8 models and implementations
No subtasks available.
This work focuses on incorporating better measures of visual context into referring expression models and finds that visual comparison to other objects within an image helps improve performance significantly.
Kosmos-2, a Multimodal Large Language Model (MLLM), is introduced, enabling new capabilities of perceiving object descriptions and grounding text to the visual world and sheds light on the big convergence of language, multimodal perception, action, and world modeling.
This paper presents a new approach (NeuralREG), relying on deep neural networks, which makes decisions about form and content in one go without explicit feature extraction, using a delexicalized version of the WebNLG corpus.
The enrichment of WebNLG corpus is described with the aim to further extend its usefulness as a resource for evaluating common NLG tasks, including Discourse Ordering, Lexicalization and Referring Expression Generation.
A profile-based deep neural network model is proposed, ProfileREG, which encodes both the local context and an external profile of the entity to generate reference realizations and generates tokens by learning to choose between generating pronouns, generating from a fixed vocabulary, or copying a word from the profile.
A trainable neural planning component is introduced that can generate effective plans several orders of magnitude faster than the original planner and a verification-by-reranking stage that substantially improves the faithfulness of the resulting texts is introduced.
Pento-DIARef is presented, a diagnostic dataset in a visual domain of puzzle pieces where referring expressions are generated by a well-known symbolic algorithm (the “Incremental Algorithm”), which itself is motivated by appeal to a hypothesised capability (eliminating distractors through application of Gricean maxims).
An Interactive REG (IREG) model is proposed that can interact with a real REC model, utilizing signals indicating whether the object is located and the visual region located by the REC model to gradually modify REs.
This work introduces a collaborative image ranking task, a grounded agreement game that players are tasked with reaching agreement on how to rank a set of images given some sorting criterion through a largely unrestricted, role-symmetric dialogue.
This work presents Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly in-tertwined with corresponding object segmentation masks and is flexible enough to accept both textual and optional visual prompts (region of interest) as input.
Adding a benchmark result helps the community track progress.