Egocentric Video-Language Pretraining (2022-06-03T00:00:00.000000Z)

TL;DR

The recently released Ego4D dataset is exploited to pioneer Egocentric VLP along three directions, and a novel pretraining objective is proposed, dubbed EgoNCE, which adapts video-text contrastive learning to the egocentric domain by mining egOCentric-aware positive and negative samples.

Abstract

Video-Language Pretraining (VLP), which aims to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention. Best performing works rely on large-scale, 3rd-person video-text datasets, such as HowTo100M. In this work, we exploit the recently released Ego4D dataset to pioneer Egocentric VLP along three directions. (i) We create EgoClip, a 1st-person video-text pretraining dataset comprising 3.8M clip-text pairs well-chosen from Ego4D, covering a large variety of human daily activities. (ii) We propose a novel pretraining objective, dubbed EgoNCE, which adapts video-text contrastive learning to the egocentric domain by mining egocentric-aware positive and negative samples. (iii) We introduce EgoMCQ, a development benchmark that is close to EgoClip and hence can support effective validation and fast exploration of our design decisions in EgoClip and EgoNCE. Furthermore, we demonstrate strong performance on five egocentric downstream tasks across three datasets: video-text retrieval on EPIC-KITCHENS-100; action recognition on Charades-Ego; natural language query, moment query, and object state change classification on Ego4D challenge benchmarks. The dataset and code are available at https://github.com/showlab/EgoVLP.

Authors

Bernard Ghanem

28 papers

Alex Wang

6 papers

D. Damen

12 papers

TL;DR

Abstract

Authors

References60 items

AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant

Object-aware Video-language Pre-training for Retrieval

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Video Swin Transformer

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

On Semantic Similarity in Video Retrieval

Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling

Is Space-Time Attention All You Need for Video Understanding?

Video Self-Stitching Graph Network for Temporal Action Localization

VLG-Net: Video-Language Graph Matching Network for Video Grounding

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Support-set bottlenecks for video-text representation learning

ActBERT: Learning Global-Local Video-Text Representations

Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

Span-based Localizing Network for Natural Language Video Localization

Unsupervised and Semi-Supervised Domain Adaptation for Action Recognition from Drones

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

Temporal Localization of Moments in Video Collections with Natural Language

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

VideoBERT: A Joint Model for Video and Language Representation Learning

SlowFast Networks for Video Recognition

Localizing Moments in Video with Temporal Language

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

Representation Learning with Contrastive Predictive Coding

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

Actor and Observer: Joint Modeling of First and Third-Person Videos

When will you do what? - Anticipating Temporal Occurrences of Activities

End-to-End Dense Video Captioning with Masked Transformer

Reconstruction Network for Video Captioning

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

Video Question Answering via Gradually Refined Attention over Appearance and Motion

Localizing Moments in Video with Natural Language

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

TALL: Temporal Activity Localization via Language Query

Dense-Captioning Events in Videos

Towards Automatic Learning of Procedures From Web Instructional Videos

SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Delving into egocentric actions

ActivityNet: A large-scale video benchmark for human activity understanding

Adam: A Method for Stochastic Optimization

Discovering important people and objects for egocentric video summarization

Social interactions: A first-person perspective

Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition

A user attention model for video summarization

Detecting Moments and Highlights in Videos via Natural Language Queries

#C C draws on a book

In Fig 8, we visualize some clip-text pairs created by our strategy

Visualization of EgoClip clip-text pairs. We sample five frames uniformly for each clip and take its narration as its caption

#C C picks the chopsticks

#C C ties the vegetable with a band

#C C stretches his left hand

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL

Alternate Names