3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in video-semantic-segmentation-3
Use these libraries to find video-semantic-segmentation-3 models and implementations
This paper exploits the capability of global context information by different-region-based context aggregation through the pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet) to produce good quality results on the scene parsing task.
The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
This work addresses semi-supervised video object segmentation, the task of automatically generating accurate and consistent pixel masks for objects in a video sequence, given the first-frame ground truth annotations, with the PReMVOS algorithm.
The MiVOS framework is presented, which decouples interaction-to-mask and mask propagation, allowing for higher generalizability and better performance, and a large-scale synthetic VOS dataset with pixel-accurate segmentation of 4.8M frames is contributed to facilitate future research.
It is found Mask2Former achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline, and is also capable of handling video semantic and panoptic segmentation.
The hypothesis is that if the representation is good for recognition, it requires the convolutional features to find correspondence between similar objects or parts, and VFS surpasses state-of-the-art self-supervised approaches for both OTB visual object tracking and DAVIS video object segmentation.
The results indicate that using a larger training set is not automatically better, and that for the video object segmentation task a smaller training set that is closer to the target domain is more effective.
This work builds a new large-scale video object segmentation dataset called YouTube Video Object Segmentation dataset (YouTube-VOS) and proposes a novel sequence-to-sequence network to fully exploit long-term spatial-temporal information in videos for segmentation.
For each pixel, a novel criss-cross attention module in CCNet harvests the contextual information of all the pixels on its criss-cross path by taking a further recurrent operation, each pixel can finally capture the full-image dependencies from all pixels.
An interactive video object segmentation algorithm, which takes scribble annotations on query objects as input, is proposed in this paper and outperforms the state-of-the-art conventional algorithms.
Adding a benchmark result helps the community track progress.