3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in open-vocabulary-panoptic-segmentation-2
Use these libraries to find open-vocabulary-panoptic-segmentation-2 models and implementations
No subtasks available.
This letter proposes the first algorithm for open-vocabulary panoptic segmentation in 3D scenes, which achieves panoptic segmentation performance similar to the state-of-the-art closed-set 3D systems on the HyperSim, ScanNet and Replica dataset and additionally outperforms current 3D open-vocabulary systems in terms of semantic segmentation.
The finding suggests that MaskCLIP can serve as a new reliable source of supervision for dense prediction tasks to achieve annotation-free segmentation in pixel-level dense prediction, specifically in semantic segmentation.
The developed MaskCLIP is an encoder-only module that seamlessly integrates mask tokens with a pre-trained ViT CLIP model for semantic/instance segmentation and class prediction that avoids the time-consuming student-teacher training process.
ODISE is presented, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation and outperforms the previous state of the art by significant margins on both open- VocabularyPanoptic and semantic segmentation tasks.
This work proposes to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off.
An in-depth analysis of the region-language alignment in CLIP models is embarked on, which proposes an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs.
An open-vocabulary panoptic segmentation model that effectively unifies the strengths of the Segment Anything Model with the vision-language CLIP model in an end-to-end framework and proposes a novel Local Discriminative Pooling (LDP) module leveraging class-agnostic SAM and class-aware CLIP features for unbiased open-vocabulary classification.
Adding a benchmark result helps the community track progress.