Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
About

Support

Contact
Terms of Service
Privacy Policy

© 2026 Papersgraph. All rights reserved.

computer-vision-5

Video Panoptic Segmentation

3260 papers • 126 benchmarks • 313 datasets

Video Panoptic Segmentation is a computer vision task that extends panoptic segmentation by incorporating temporal dimension. That is, given a video sequence, the goal is to predict the semantic class of each pixel while consistently tracking object instances. Here, the pixels belonging to the same object instance should be assigned the same instance ID throughout the video sequence.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in video-panoptic-segmentation-5

Trend

Dataset

Best Model

Actions

VIPSeg

VIPSeg

Cityscapes-VPS

Cityscapes-VPS

KITTI-STEP

KITTI-STEP

Libraries

i

Use these libraries to find video-panoptic-segmentation-5 models and implementations

google-research/deeplab2

2 papers 982

Datasets

Cityscapes-VPS

VIPSeg

KITTI-STEP

LaRS

Subtasks

No subtasks available.

Most implemented papers

Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation

Chen Change Loy, Xiangtai Li, Jiangmiao Pang, Haobo Yuan, Wenwei Zhang, Guangliang Cheng•Tue Mar 21 2023

Tube-Link is a near-online approach that takes a short subclip as input and outputs the corresponding spatial-temporal tube masks, and introduces temporal contrastive learning to instance-wise discriminative features for tube-level association.

28

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

zhang-tao-whu/DVIS

2 papers 114

0

ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation

Liang-Chieh Chen, A. Yuille, Siyuan Qiao, Hartwig Adam, Yukun Zhu•Tue Dec 08 2020

In this paper, we present ViP-DeepLab, a unified model attempting to tackle the long-standing and challenging inverse projection problem in vision, which we model as restoring the point clouds from perspective image sequences while providing each point with instance-level semantic interpretations. Solving this problem requires the vision models to predict the spatial location, semantic class, and temporally consistent instance label for each 3D point. ViP-DeepLab approaches it by jointly performing monocular depth estimation and video panoptic segmentation. We name this joint task as Depth-aware Video Panoptic Segmentation, and propose a new evaluation metric along with two derived datasets for it, which will be made available to the public. On the individual sub-tasks, ViP-DeepLab also achieves state-of-the-art results, outperforming previous methods by 5.1% VPQ on Cityscapes-VPS, ranking 1st on the KITTI monocular depth estimation benchmark, and 1st on KITTI MOTS pedestrian. The datasets and the evaluation codes are made publicly available1.

165 0

STEP: Segmenting and Tracking Every Pixel

Bradley Green, D. Cremers, Andreas Geiger, Liang-Chieh Chen, B. Leibe, L. Leal-Taixé, Hartwig Adam, Maxwell D. Collins, P. Voigtlaender, Aljosa Osep, Yukun Zhu, Mark Weber, Jun Xie•Mon Feb 22 2021

This work introduces a new benchmark encompassing two datasets, KITTI-STEP and MOTChallenge-STEP, and proposes a novel evaluation metric Segmentation and Tracking Quality (STQ) that fairly balances semantic and tracking aspects of this task and is more appropriate for evaluating sequences of arbitrary length.

83 0

PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation

Lefei Zhang, D. Tao, Jing Zhang, Xiangtai Li, Yunhai Tong, Haobo Yuan, Guangliang Cheng, Yibo Yang•Sat Dec 04 2021

PolyphonicFormer, a vision transformer to unify these sub-tasks under the DVPS task and lead to more robust results, achieves state-of-the-art results on two DVPS datasets, and ranks 1st on the ICCV-2021 BMTT Challenge video + depth track.

49 0

Large-scale Video Panoptic Segmentation in the Wild: A Benchmark

Yunchao Wei, Jiaxu Miao, Yu Wu, Xiaohan Wang, Wei Li, Xu Zhang, Yi Yang•Tue May 31 2022

In this paper, we present a new large-scale dataset for the video panoptic segmentation task, which aims to assign semantic classes and track identities to all pixels in a video. As the ground truth for this task is difficult to annotate, previous datasets for video panoptic segmentation are limited by either small scales or the number of scenes. In contrast, our large-scale VIdeo Panoptic Segmentation in the Wild (VIPSeg) dataset provides 3,536 videos and 84,750 frames with pixel-level panoptic annotations, covering a wide range of real-world scenarios and categories. To the best of our knowledge, our VIPSeg is the first attempt to tackle the challenging video panoptic segmentation task in the wild by considering diverse scenarios. Based on VIPSeg, we evaluate existing video panoptic segmentation approaches and propose an efficient and effective clip-based baseline method to analyze our VIPSeg dataset. Our dataset is available at https://github.com/VIPSeg-Dataset/VIPSeg-Dataset/.

113 0

Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation

Chen Change Loy, Kai Chen, Wenwei Zhang, Xiangtai Li, Yunhai Tong, Jiangmiao Pang, Guangliang Cheng•Sat Apr 09 2022

Video K-Net is presented, a simple, strong, and unified framework for fully end-to-end video panoptic seg-mentation that achieves state-of-the-art videoPanoptic segmentation results on Citscapes-VPS and KITTI-STEP without bells and whistles and can serve as a new flexible baseline in video segmentation.

111 0

Waymo Open Dataset: Panoramic Video Panoptic Segmentation

Xinchen Yan, Liang-Chieh Chen, Siyuan Qiao, A. Z. Zhu, Yukun Zhu, Dragomir Anguelov, Henrik Kretzschmar, Jieru Mei, Han Yan•Tue Jun 14 2022

The Waymo Open Dataset is presented, a large-scale dataset that offers high-quality panoptic segmentation labels for autonomous driving and a new benchmark for Panoramic Video Panoptic Segmentation based on the DeepLab family of models is proposed.

77 0

PVO: Panoptic Visual Odometry

H. Bao, Shuo Chen, Weicai Ye, Xinyue Lan, Yuhang Ming, Xin-rong Yu, Zhaopeng Cui, Guofeng Zhang•Sun Jul 03 2022

PVO models visual odometry (VO) and video panoptic segmentation (VPS) in a unified view, which makes the two tasks mutually beneficial, and contributes to each other through recurrent iterative optimization.

47 0

Context-Aware Relative Object Queries to Unify Video Instance and Panoptic Segmentation

A. Schwing, Girish V. Chowdhary, Anwesa Choudhuri•Wed May 31 2023

Object queries have emerged as a powerful abstraction to generically represent object proposals. However, their use for temporal tasks like video segmentation poses two questions: 1) How to process frames sequentially and propagate object queries seamlessly across frames. Using independent object queries per frame doesn't permit tracking, and requires post-processing. 2) How to produce temporally consistent, yet expressive object queries that model both appearance and position changes. Using the entire video at once doesn't capture position changes and doesn't scale to long videos. As one answer to both questions we propose 'context-aware relative object queries', which are continuously propagated frame-by-frame. They seamlessly track objects and deal with occlusion and re-appearance of objects, without post-processing. Further, we find context-aware relative object queries better capture position changes of objects in motion. We evaluate the proposed approach across three challenging tasks: video instance segmentation, multi-object tracking and segmentation, and video panoptic segmentation. Using the same approach and architecture, we match or surpass state-of-the art results on the diverse and challenging OVIS, Youtube-VIS, Cityscapes-VPS, MOTS 2020 and KITTI-MOTS data.

9 0

Adding a benchmark result helps the community track progress.

Video Panoptic Segmentation | State-of-the-Art