3260 papers • 126 benchmarks • 313 datasets
The task aims at labeling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an individual object in a discourse or scene (the referent). REs unambiguously identify the target instance.
(Image credit: Papersgraph)
These leaderboards are used to track progress in referring-expression-segmentation-9
Use these libraries to find referring-expression-segmentation-9 models and implementations
This work proposes a system that can generate image segmentations based on arbitrary prompts at test time, and builds upon the CLIP model as a backbone which it extends with a transformer-based decoder that enables dense prediction.
Referring object detection and referring image segmentation are important tasks that require joint understanding of visual information and natural language. Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process. To address these issues and complement similar efforts in visual question answering, we build CLEVR-Ref+, a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily available, and the referring expressions are automatically associated with functional programs. The synthetic nature allows control over dataset bias (through sampling strategy), and the modular programs enable intermediate reasoning ground truth without human annotators. In addition to evaluating several state-of-the-art models on CLEVR-Ref+, we also propose IEP-Ref, a module network approach that significantly outperforms other models on our dataset. In particular, we present two interesting and important findings using IEP-Ref: (1) the module trained to transform feature maps into segmentation masks can be attached to any intermediate module to reveal the entire reasoning process step-by-step; (2) even if all training data has at least one object referred, IEP-Ref can correctly predict no-foreground when presented with false-premise referring expressions. To the best of our knowledge, this is the first direct and quantitative proof that neural modules behave in the way they are intended. We will release data and code for CLEVR-Ref+.
An end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information is proposed that can produce quality segmentation output from the natural language expression, and outperforms baseline methods by a large margin.
This work argues that existing benchmarks used for the task of video object segmentation with referring expressions are mainly composed of trivial cases, in which referents can be identified with simple phrases, and relies on a new categorization of the phrases in the DAVIS-2017 and Actor-Action datasets into trivial and non-trivial REs.
This work proposes a novel method, namely SynthRef, for generating synthetic referring expressions for target objects in an image (or video frame), and presents and disseminates the first large-scale dataset with synthetic referring expression for video object segmentation.
Evaluation on standard benchmarks reveals that MTTR significantly outperforms previous art across multiple metrics, and is end-to-end trainable, free of text-related inductive bias components and requires no additional mask-refinement post-processing steps.
A region-based GRES baseline ReLA is proposed that adaptively divides the image into regions with subinstance clues, and explicitly models the region-region and region-language dependencies and achieves new state-of-the-art performance on the both newly proposed GRES and classic RES tasks.
This work proposes to decompose expressions into three modular components related to subject appearance, location, and relationship to other objects, which allows for flexibly adapt to expressions containing different types of information in an end-to-end framework.
This work presents Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly in-tertwined with corresponding object segmentation masks and is flexible enough to accept both textual and optional visual prompts (region of interest) as input.
Adding a benchmark result helps the community track progress.