Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
About

Support

Contact
Terms of Service
Privacy Policy

© 2026 Papersgraph. All rights reserved.

computer-vision

Video Object Detection

3260 papers • 126 benchmarks • 313 datasets

Video object detection is the task of detecting objects from a video as opposed to images. ( Image credit: Learning Motion Priors for Efficient Video Object Detection )

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in video-object-detection

Trend

Dataset

Best Model

Actions

ImageNet VID

ImageNet VID

EPIC KITCHENS-seen splits

EPIC KITCHENS-seen splits

EPIC KITCHENS-unseen splits

Libraries

i

Use these libraries to find video-object-detection models and implementations

guanxiongsun/vfe.pytorch

4 papers 28

Datasets

Waymo Open Dataset

Waymo Open Dataset

EPIC-KITCHENS-55

EPIC-KITCHENS-55

ImageNet VID

GEN1 Detection

YT-BB

OAK

Subtasks

No subtasks available.

Most implemented papers

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Piotr Bojanowski, Armand Joulin, J. Mairal, Herv'e J'egou, Ishan Misra, Hugo Touvron•Wed Apr 28 2021

This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.

8115

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

EPIC KITCHENS-unseen splits

USC-GRAD-STDdb

USC-GRAD-STDdb

EPIC-KITCHENS-55

EPIC-KITCHENS-55

YT-BB

YT-BB

Waymo Open Dataset

Waymo Open Dataset

01/01/19679682867

01/01/19679682867

open-mmlab/mmtracking

3 papers 3,683

lingyunwu14/STFT

2 papers 49

SYNTHIA-AL

Underwater Trash Detection

Underwater Trash Detection

USC-GRAD-STDdb

THGP

0

TSM: Temporal Shift Module for Efficient Video Understanding

Song Han, Chuang Gan, Ji Lin•Mon Nov 19 2018

A generic and effective Temporal Shift Module (TSM) that can achieve the performance of 3D CNN but maintain 2D CNN’s complexity and is extended to online setting, which enables real-time low-latency online video recognition and video object detection.

1963 0

Mobile Video Object Detection with Temporally-Aware Feature Maps

Mason Liu, Menglong Zhu•Thu Nov 16 2017

This approach combines fast single-image object detection with convolutional long short term memory layers to create an inter-weaved recurrent-convolutional architecture that is substantially faster than existing detection methods in video and significantly reduces computational cost.

210 0

Transferable Adversarial Attacks for Image and Video Object Detection

Xiaochun Cao, Jun Zhu, Xingxing Wei, Siyuan Liang•Thu Nov 29 2018

The proposed method is based on the Generative Adversarial Network (GAN) framework, where it combines the high-level class loss and low-level feature loss to jointly train the adversarial example generator, and can efficiently generate image and video adversarial examples that have better transferability.

255 0

Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection

Sara Beery, Jonathan Huang, Guanhang Wu, V. Rathod, Ronny Votel•Fri Dec 06 2019

This paper proposes an attention-based approach that allows the model, Context R-CNN, to index into a long term memory bank constructed on a per-camera basis and aggregate contextual features from other frames to boost object detection performance on the current frame.

126 0

HoughNet: Integrating Near and Long-Range Evidence for Visual Detection

Nermin Samet, Emre Akbas, Samet Hicsonmez•Tue Apr 13 2021

The effectiveness of HoughNet is validated in other visual detection tasks, namely, video object detection, instance segmentation, 3D object detection and keypoint detection for human pose estimation, and an additional “labels to photo’’ image generation task, where the integration of the voting module consistently improves performance in all cases.

13 0

Towards High Performance Video Object Detection for Mobiles

Lu Yuan, Xizhou Zhu, Jifeng Dai, Yichen Wei, Xingchi Zhu•Sun Apr 15 2018

A light weight network architecture for video object detection on mobiles, with a very small network for establishing correspondence across frames, and a flow-guided GRU module designed to effectively aggregate features on key frames.

49 0

TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers

Li Niu, Liqing Zhang, Lizhuang Ma, Yunhai Tong, Guangliang Cheng, Lu He, Qianyu Zhou, Xiangtai Li, Xiao Li, Wenxuan Liu•Wed Jan 12 2022

TransVOD is presented, the first end-to-end video object detection system based on simple yet effective spatial-temporal Transformer architectures that streamline the pipeline of current VOD, effectively removing the need for many hand-crafted components for feature aggregation.

188 0

Flow-Guided Feature Aggregation for Video Object Detection

Lu Yuan, Xizhou Zhu, Jifeng Dai, Yichen Wei, Yujie Wang•Tue Mar 28 2017

This work presents flow-guided feature aggregation, an accurate and end-to-end learning framework for video object detection that improves the per-frame features by aggregation of nearby features along the motion paths, and thus improves the video recognition accuracy.

662 0

Looking Fast and Slow: Memory-Guided Mobile Video Object Detection

Dmitry Kalenichenko, Yinxiao Li, Mason Liu, Menglong Zhu, Marie White•Sun Mar 24 2019

This paper addresses the analogous question of whether using memory in computer vision systems can not only improve the accuracy of object detection in video streams, but also reduce the computation time by interleaving conventional feature extractors with extremely lightweight ones which only need to recognize the gist of the scene.

87 0

Adding a benchmark result helps the community track progress.

Video Object Detection | State-of-the-Art