Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
About

Support

Contact
Terms of Service
Privacy Policy

© 2026 Papersgraph. All rights reserved.

computer-vision-9

Referring Expression Segmentation

3260 papers • 126 benchmarks • 313 datasets

The task aims at labeling the pixels of an image or video that represent an object instance referred by a linguistic expression. In particular, the referring expression (RE) must allow the identification of an individual object in a discourse or scene (the referent). REs unambiguously identify the target instance.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in referring-expression-segmentation-9

Trend

Dataset

Best Model

Actions

A2D Sentences

A2D Sentences

J-HMDB

J-HMDB

Refer-YouTube-VOS (2021 public validation)

Libraries

i

Use these libraries to find referring-expression-segmentation-9 models and implementations

Datasets

DAVIS 2017

RefCOCO

JHMDB

Referring Expressions for DAVIS 2016 & 2017

Referring Expressions for DAVIS 2016 & 2017

A2D Sentences

Subtasks

Generalized Referring Expression Segmentation

Most implemented papers

Image Segmentation Using Text and Image Prompts

Alexander S. Ecker, Timo Lüddecke•Fri Dec 17 2021

This work proposes a system that can generate image segmentations based on arbitrary prompts at test time, and builds upon the CLIP model as a backbone which it extends with a transformer-based decoder that enables dense prediction.

668

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

Refer-YouTube-VOS (2021 public validation)

RefCOCO+ val

RefCOCO+ val

RefCOCO+ testA

RefCOCO+ testA

RefCOCO+ test B

RefCOCO+ test B

DAVIS 2017 (val)

DAVIS 2017 (val)

RefCOCOg-val

RefCOCOg-val

PhraseCut

PhraseCut

RefCOCOg-test

RefCOCOg-test

ReferIt

ReferIt

Refer-YouTube-VOS

Refer-YouTube-VOS

CLEVR-Ref+

CLEVR-Ref+

A2Dre test

A2Dre test

Referring Expressions for DAVIS 2016 & 2017

Referring Expressions for DAVIS 2016 & 2017

G-Ref val

G-Ref val

G-Ref test A

G-Ref test A

G-Ref test B

G-Ref test B

Refer-YouTube-VOS

Refer-YouTube-VOS

Google Refexp

PhraseCut

CLEVR-Ref+

A2Dre

0

CLEVR-Ref+: Diagnosing Visual Reasoning With Referring Expressions

A. Yuille, Chenxi Liu, Runtao Liu, Yutong Bai•Wed Jan 02 2019

Referring object detection and referring image segmentation are important tasks that require joint understanding of visual information and natural language. Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process. To address these issues and complement similar efforts in visual question answering, we build CLEVR-Ref+, a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily available, and the referring expressions are automatically associated with functional programs. The synthetic nature allows control over dataset bias (through sampling strategy), and the modular programs enable intermediate reasoning ground truth without human annotators. In addition to evaluating several state-of-the-art models on CLEVR-Ref+, we also propose IEP-Ref, a module network approach that significantly outperforms other models on our dataset. In particular, we present two interesting and important findings using IEP-Ref: (1) the module trained to transform feature maps into segmentation masks can be attached to any intermediate module to reveal the entire reasoning process step-by-step; (2) even if all training data has at least one object referred, IEP-Ref can correctly predict no-foreground when presented with false-premise referring expressions. To the best of our knowledge, this is the first direct and quantitative proof that neural modules behave in the way they are intended. We will release data and code for CLEVR-Ref+.

142 0

Segmentation from Natural Language Expressions

Trevor Darrell, Marcus Rohrbach, Ronghang Hu•Sat Mar 19 2016

An end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information is proposed that can produce quality segmentation output from the natural language expression, and outperforms baseline methods by a large margin.

511 0

RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation

Jordi Torres, Xavier Giró-i-Nieto, Miriam Bellver, Carles Ventura, Carina Silberer, Ioannis V. Kazakos•Wed Sep 30 2020

This work argues that existing benchmarks used for the task of video object segmentation with referring expressions are mainly composed of trivial cases, in which referents can be identified with simple phrases, and relies on a new categorization of the phrases in the DAVIS-2017 and Actor-Action datasets into trivial and non-trivial REs.

37 0

SynthRef: Generation of Synthetic Referring Expressions for Object Segmentation

Xavier Giró-i-Nieto, Miriam Bellver, Carles Ventura, Carina Silberer, Ioannis V. Kazakos•Mon Jun 07 2021

This work proposes a novel method, namely SynthRef, for generating synthetic referring expressions for target objects in an image (or video frame), and presents and disseminates the first large-scale dataset with synthetic referring expression for video object segmentation.

3 0

End-to-End Referring Video Object Segmentation with Multimodal Transformers

Adam Botach, Chaim Baskin, Evgenii Zheltonozhskii•Sun Nov 28 2021

Evaluation on standard benchmarks reveals that MTTR significantly outperforms previous art across multiple metrics, and is end-to-end trainable, free of text-related inductive bias components and requires no additional mask-refinement post-processing steps.

207 0

GRES: Generalized Referring Expression Segmentation

Henghui Ding, Chang Liu, Xudong Jiang•Wed May 31 2023

A region-based GRES baseline ReLA is proposed that adaptively divides the image into regions with subinstance clues, and explicitly models the region-region and region-language dependencies and achieves new state-of-the-art performance on the both newly proposed GRES and classic RES tasks.

252 0

MAttNet: Modular Attention Network for Referring Expression Comprehension

Xiaohui Shen, Xin Lu, Zhe L. Lin, Tamara L. Berg, Jimei Yang, Mohit Bansal, Licheng Yu•Tue Jan 23 2018

This work proposes to decompose expressions into three modular components related to subject appearance, location, and relationship to other objects, which allows for flexibly adapt to expressions containing different types of information in an end-to-end framework.

915 0

GLaMM: Pixel Grounding Large Multimodal Model

R. Anwer, Hisham Cholakkal, Ming-Hsuan Yang, F. Khan, Muhammad Maaz, H. Rasheed, Sahal Shaji Mullappilly, Salman H. Khan, Abdelrahman M. Shaker, Eric P. Xing•Sun Nov 05 2023

This work presents Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly in-tertwined with corresponding object segmentation masks and is flexible enough to accept both textual and optional visual prompts (region of interest) as input.

411 0

Adding a benchmark result helps the community track progress.

Referring Expression Segmentation | State-of-the-Art