InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges (2022-11-17T00:00:00.000000Z)

TL;DR

In these five tasks, the performance of InternVideo-Ego4D comprehensively surpasses the baseline methods and the champions of CVPR2022, demonstrating the powerful representation ability of Intern video as a video foundation model.

Abstract

In this report, we present our champion solutions to five tracks at Ego4D challenge. We leverage our developed InternVideo, a video foundation model, for five Ego4D tasks, including Moment Queries, Natural Language Queries, Future Hand Prediction, State Change Object Detection, and Short-term Object Interaction Anticipation. InternVideo-Ego4D is an effective paradigm to adapt the strong foundation model to the downstream ego-centric video understanding tasks with simple head designs. In these five tasks, the performance of InternVideo-Ego4D comprehensively surpasses the baseline methods and the champions of CVPR2022, demonstrating the powerful representation ability of InternVideo as a video foundation model. Our code will be released at https://github.com/OpenGVLab/ego4d-eccv2022-solutions

Authors

Zhe Chen

3 papers

Jiashuo Yu

6 papers

Yali Wang

11 papers

TL;DR

Abstract

Authors

References60 items

Egocentric Video-Language Pretraining

Vision Transformer Adapter for Dense Predictions

BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection

An Empirical Study of End-to-End Temporal Action Detection

AdaMixer: A Fast-Converging Query-Based Object Detector

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

ActionFormer: Localizing Moments of Actions with Transformers

UniFormer: Unifying Convolution and Self-Attention for Visual Recognition

DCAN: Improving Temporal Action Detection via Dual Context Aggregation

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Video Swin Transformer

Space-time Mixing Attention for Video Transformer

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

VidTr: Video Transformer Without Convolutions

Multiscale Vision Transformers

ViViT: A Video Vision Transformer

Learning Salient Boundary Feature for Anchor-free Temporal Action Localization

MoViNets: Mobile Video Networks for Efficient Video Recognition

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Is Space-Time Attention All You Need for Video Understanding?

Video Transformer Network

Video Self-Stitching Graph Network for Temporal Action Localization

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Deformable DETR: Deformable Transformers for End-to-End Object Detection

BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization

End-to-End Object Detection with Transformers

TAM: Temporal Adaptive Module for Video Recognition

Asynchronous Interaction Aggregation for Action Detection

X3D: Expanding Architectures for Efficient Video Recognition

Span-based Localizing Network for Natural Language Video Localization

Objects365: A Large-Scale, High-Quality Dataset for Object Detection

BMN: Boundary-Matching Network for Temporal Action Proposal Generation

MMDetection: Open MMLab Detection Toolbox and Benchmark

Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition

CenterNet: Keypoint Triplets for Object Detection

Video Classification With Channel-Separated Convolutional Networks

SlowFast Networks for Video Recognition

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

A Closer Look at Spatiotemporal Convolutions for Action Recognition

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

Localizing Moments in Video with Natural Language

Deep Layer Aggregation

Attention is All you Need

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

The Kinetics Human Action Video Dataset

TALL: Temporal Activity Localization via Language Query

Deep Residual Learning for Image Recognition

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Fast R-CNN

Learning Spatiotemporal Features with 3D Convolutional Networks

ImageNet: A large-scale hierarchical image database

Efficient Non-Maximum Suppression

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Abhinav Kumar Gupta, and Kaiming He. Non-local neural networks

Non-local neural networks. 2018

Girshick , Abhinav Kumar Gupta , and Kaiming He

Region-Based Convolutional Networks for Accurate Object Detection and Segmentation

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL

Alternate Names