3260 papers • 126 benchmarks • 313 datasets
A segmentation task which does not utilise any human-level supervision for semantic segmentation except for a backbone which is initialised with features pre-trained with image-level labels.
(Image credit: Papersgraph)
These leaderboards are used to track progress in unsupervised-semantic-segmentation
Use these libraries to find unsupervised-semantic-segmentation models and implementations
No subtasks available.
A hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision.
This work leverages the retrieval abilities of one language-image pre-trained model, CLIP, to dynamically curate training sets from unlabelled images for arbitrary collections of concept names, and leverage the robust correspondences offered by modern image representations to co-segment entities among the resulting collections.
This work examines how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery, and proposes a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
This work dissects the preservation of patch-wise spatial information in CLIP and proposed a local-to-global framework to obtain image tags, which significantly enhances its multi-label classification performance on various benchmarks without dataset-specific training.
A novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), is introduced, which ensures the unbiased image-text alignment of the CLIP-based models using only image-text pairs without necessitating additional supervision.
This work proposes an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs and indicates that attribute supervision makes vision-language models accurately localize attribute-specified objects.
The finding suggests that MaskCLIP can serve as a new reliable source of supervision for dense prediction tasks to achieve annotation-free segmentation in pixel-level dense prediction, specifically in semantic segmentation.
This paper proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment and achieves state-of-the-art zero-shot segmentation performances with large margins in all datasets.
Adding a benchmark result helps the community track progress.