3260 papers • 126 benchmarks • 313 datasets
Video object detection is the task of detecting objects from a video as opposed to images. ( Image credit: Learning Motion Priors for Efficient Video Object Detection )
(Image credit: Papersgraph)
These leaderboards are used to track progress in video-object-detection
Use these libraries to find video-object-detection models and implementations
No subtasks available.
This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.
A generic and effective Temporal Shift Module (TSM) that can achieve the performance of 3D CNN but maintain 2D CNN’s complexity and is extended to online setting, which enables real-time low-latency online video recognition and video object detection.
This approach combines fast single-image object detection with convolutional long short term memory layers to create an inter-weaved recurrent-convolutional architecture that is substantially faster than existing detection methods in video and significantly reduces computational cost.
The proposed method is based on the Generative Adversarial Network (GAN) framework, where it combines the high-level class loss and low-level feature loss to jointly train the adversarial example generator, and can efficiently generate image and video adversarial examples that have better transferability.
This paper proposes an attention-based approach that allows the model, Context R-CNN, to index into a long term memory bank constructed on a per-camera basis and aggregate contextual features from other frames to boost object detection performance on the current frame.
The effectiveness of HoughNet is validated in other visual detection tasks, namely, video object detection, instance segmentation, 3D object detection and keypoint detection for human pose estimation, and an additional “labels to photo’’ image generation task, where the integration of the voting module consistently improves performance in all cases.
A light weight network architecture for video object detection on mobiles, with a very small network for establishing correspondence across frames, and a flow-guided GRU module designed to effectively aggregate features on key frames.
TransVOD is presented, the first end-to-end video object detection system based on simple yet effective spatial-temporal Transformer architectures that streamline the pipeline of current VOD, effectively removing the need for many hand-crafted components for feature aggregation.
This work presents flow-guided feature aggregation, an accurate and end-to-end learning framework for video object detection that improves the per-frame features by aggregation of nearby features along the motion paths, and thus improves the video recognition accuracy.
This paper addresses the analogous question of whether using memory in computer vision systems can not only improve the accuracy of object detection in video streams, but also reduce the computation time by interleaving conventional feature extractors with extremely lightweight ones which only need to recognize the gist of the scene.
Adding a benchmark result helps the community track progress.