Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
About

Support

Contact
Terms of Service
Privacy Policy

© 2026 Papersgraph. All rights reserved.

Unsupervised Semantic Segmentation with Language-image Pre-training | State-of-the-Art

unsupervised-semantic-segmentation

Unsupervised Semantic Segmentation with Language-image Pre-training

3260 papers • 126 benchmarks • 313 datasets

A segmentation task which does not utilise any human-level supervision for semantic segmentation except for a backbone which is initialised with features pre-trained with image-level labels.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in unsupervised-semantic-segmentation

Trend

Dataset

Best Model

Actions

Cityscapes val

Cityscapes val

ADE20K

ADE20K

COCO-Object

COCO-Object

Libraries

i

Use these libraries to find unsupervised-semantic-segmentation models and implementations

Datasets

Cityscapes

ADE20K

COCO-Stuff

PASCAL VOC

PASCAL VOC 2007

PASCAL VOC 2007

KITTI-STEP

Subtasks

No subtasks available.

Most implemented papers

GroupViT: Semantic Segmentation Emerges from Text Supervision

X. Wang, Wonmin Byeon, Thomas Breuel, Jiarui Xu, Sifei Liu, Jan Kautz, Shalini De Mello•Mon Feb 21 2022

A hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision.

641

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

PASCAL Context-59

PASCAL Context-59

COCO-Stuff-171

COCO-Stuff-171

PASCAL VOC

PASCAL VOC

PascalVOC-20

PascalVOC-20

COCO-Stuff-27

COCO-Stuff-27

KITTI-STEP

KITTI-STEP

MS COCO

MS COCO

PASCAL VOC 2007

PASCAL VOC 2007

0

ReCo: Retrieve and Co-segment for Zero-shot Transfer

Weidi Xie, Samuel Albanie, Gyungin Shin•Mon Jun 13 2022

This work leverages the retrieval abilities of one language-image pre-trained model, CLIP, to dynamically curate training sets from unlabelled images for arbitrary collections of concept names, and leverage the robust correspondences offered by modern image representations to co-segment entities among the resulting collections.

128 0

Perceptual Grouping in Contrastive Vision-Language Models

Jonathon Shlens, Alexander Toshev, Yinfei Yang, S. Ravi, Kanchana Ranasinghe, Brandon McKinzie•Mon Oct 17 2022

This work examines how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery, and proposes a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.

74 0

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training

Haifeng Liu, Yuqi Lin, Minghao Chen, Kaipeng Zhang, Hengjia Li, Mingming Li, Zheng Yang, Dongqin Lv, Binbin Lin, Deng Cai•Tue Dec 19 2023

This work dissects the preservation of patch-wise spatial information in CLIP and proposed a local-to-global framework to obtain image tags, which significantly enhances its multi-label classification performance on various benchmarks without dataset-specific training.

28 0

TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, Kyungsu Kim•Fri Mar 29 2024

A novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), is introduced, which ensures the unbiased image-text alignment of the CLIP-based models using only image-text pairs without necessitating additional supervision.

4 0

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Qinying Liu, Kecheng Zheng, Wei Wu, Zhan Tong, Yu Liu, Wei Chen, Zilei Wang, Yujun Shen•Wed Dec 20 2023

This work proposes an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs and indicates that attribute supervision makes vision-language models accurately localize attribute-specified objects.

7 0

Extract Free Dense Labels from CLIP

Chen Change Loy, Bo Dai, Chong Zhou•Wed Dec 01 2021

The finding suggests that MaskCLIP can serve as a new reliable source of supervision for dense prediction tasks to achieve annotation-free segmentation in pixel-level dense prediction, specifically in semantic segmentation.

661 0

Learning to Generate Text-Grounded Mask for Open-World Semantic Segmentation from Only Image-Text Pairs

Junbum Cha, Jonghwan Mun, Byungseok Roh•Wed Nov 30 2022

This paper proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment and achieves state-of-the-art zero-shot segmentation performances with large margins in all datasets.

132 0

Adding a benchmark result helps the community track progress.