3260 papers • 126 benchmarks • 313 datasets
The goal of video instance segmentation is simultaneous detection, segmentation and tracking of instances in videos. In words, it is the first time that the image instance segmentation problem is extended to the video domain. To facilitate research on this new task, a large-scale benchmark called YouTube-VIS, which consists of 2,883 high-resolution YouTube videos, a 40-category label set and 131k high-quality instance masks is built.
(Image credit: Papersgraph)
These leaderboards are used to track progress in video-instance-segmentation-6
Use these libraries to find video-instance-segmentation-6 models and implementations
No subtasks available.
This paper integrates appearance information to improve the performance of SORT and reduces the number of identity switches, achieving overall competitive performance at high frame rates.
The first time that the image instance segmentation problem is extended to the video domain, and a novel algorithm called MaskTrack R-CNN is proposed for this task, which is simultaneous detection, segmentation and tracking of instances in videos.
For the first time, it is demonstrated that a simple end-to-end query based framework can achieve the state-of-the-art performance in various instance-level recognition tasks.
It is found Mask2Former achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline, and is also capable of handling video semantic and panoptic segmentation.
The proposed Temporally Efficient Vision Transformer (TeViT) is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head, which fully utilizes both frame-level and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost.
A new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem, and achieves the highest speed among all existing VIS models and the best result among methods using single model on the YouTube-VIS dataset.
A simple plug-and-play module that performs temporal feature calibration to complement missing object cues caused by occlusion is presented, and a remarkable AP improvement on the OVIS dataset is obtained.
It is shown that learning additional invariances -- through the use of multi-scale cropping, stronger augmentations and nearest neighbors -- improves the representations and it is observed that MoCo learns spatially structured representations when trained with a multi-crop strategy.
This report introduces a two-step "detect-then-match" video instance segmentation method that achieves the first place in the UVO 2021 Video-based Open-World Segmentation Challenge.
D2 Conv3D is proposed: a novel type of convolution which draws inspiration from dilated and deformable convolutions and extends them to the 3D (spatio-temporal) domain and can be used to improve the performance of multiple 3D CNN architectures across multiple video segmentation related benchmarks.
Adding a benchmark result helps the community track progress.