InternVideo: General Video Foundation Models via Generative and Discriminative Learning (2022-12-06T00:00:00.000000Z)

TL;DR

Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.

Abstract

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .

Authors

References110 items

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

1st Place Solutions for RxR-Habitat Vision-and-Language Navigation Competition (CVPR 2022)

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Masked Autoencoders As Spatiotemporal Learners

CoCa: Contrastive Captioners are Image-Text Foundation Models

Flamingo: a Visual Language Model for Few-Shot Learning

Unified Contrastive Learning in Image-Text-Label Space

MultiMAE: Multi-modal Multi-task Masked Autoencoders

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Pathways: Asynchronous Distributed Dataflow for ML

All in One: Exploring Unified Video-Language Pre-Training

Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

ActionFormer: Localizing Moments of Actions with Transformers

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

Multiview Transformers for Video Recognition

MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound

Masked Feature Prediction for Self-Supervised Visual Pre-Training

BEVT: BERT Pretraining of Video Transformers

Scaling Up Vision-Language Pretraining for Image Captioning

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

Florence: A New Foundation Model for Computer Vision

INTERN: A New Learning Paradigm Towards General Vision

Masked Autoencoders Are Scalable Vision Learners

FILIP: Fine-grained Interactive Language-Image Pre-Training

An Empirical Study of Training End-to-End Vision-and-Language Transformers

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

History Aware Multimodal Transformer for Vision-and-Language Navigation

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

ActionCLIP: A New Paradigm for Video Action Recognition

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

Robust fine-tuning of zero-shot models

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

On the Opportunities and Risks of Foundation Models

EAN: Event Adaptive Network for Enhanced Action Recognition

Evidential Deep Learning for Open Set Action Recognition

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

How Much Can CLIP Benefit Vision-and-Language Tasks?

Video Swin Transformer

BEiT: BERT Pre-Training of Image Transformers

Relation Modeling in Spatio-Temporal Action Localization

MERLOT: Multimodal Neural Script Knowledge Models

FineAction: A Fine-Grained Video Dataset for Temporal Action Localization

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

ViViT: A Video Vision Transformer

Temporal Context Aggregation Network for Temporal Action Proposal Refinement

Learning Transferable Visual Models From Natural Language Supervision

Zero-Shot Text-to-Image Generation

Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Exploring Simple Siamese Representation Learning

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Learning Open Set Network with Discriminative Reciprocal Points

Generative Pretraining From Pixels

Learn to cycle: Time-consistent feature discovery for action recognition

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization

Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning

ActBERT: Learning Global-Local Video-Text Representations

Augment Your Batch: Improving Generalization Through Instance Repetition

The AVA-Kinetics Localized Human Actions Video Dataset

Asynchronous Interaction Aggregation for Action Detection

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

A Simple Framework for Contrastive Learning of Visual Representations

Learning Spatiotemporal Features via Video and Text Pair Discrimination

Self-supervising Action Recognition by Statistical Moment and Subspace Descriptors

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

Momentum Contrast for Unsupervised Visual Representation Learning

Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding

BMN: Boundary-Matching Network for Temporal Action Proposal Generation

A Short Note on the Kinetics-700 Human Action Dataset

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

VideoBERT: A Joint Model for Video and Language Representation Learning

SlowFast Networks for Video Recognition

BAR: Bayesian Activity Recognition using variational inference

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

A Short Note about Kinetics-600

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

Unsupervised Feature Learning via Non-parametric Instance Discrimination

HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

Localizing Moments in Video with Natural Language

The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Mask R-CNN

Deep Learning for Video Classification and Captioning

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

TGIF: A New Dataset and Benchmark on Animated GIF Description

Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles

100

Colorful Image Colorization

101

Towards Open Set Deep Networks

102

ActivityNet: A large-scale video benchmark for human activity understanding

103

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

104

Unsupervised Visual Representation Learning by Context Prediction

105

A dataset for Movie Description

106

Microsoft COCO: Common Objects in Context

107

HMDB: A large video database for human motion recognition

108

Collecting Highly Parallel Data for Paraphrase Evaluation

109

Unsupervised Learning of Visual Representations using Videos

110

Computer Vision and Image Understanding