TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation (2021-10-26T00:00:00.000000Z)

TL;DR

This work introduces TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention, and introduces a learned visual tokenization scheme based on spatial attention and leverage weak-supervision to allow granular cross-modal interactions for visual and pose modalities.

Abstract

The recent success of transformer models in language, such as BERT, has motivated the use of such architectures for multi-modal feature learning and tasks. However, most multi-modal variants (e.g., ViLBERT) have limited themselves to visual-linguistic data. Relatively few have explored its use in audio-visual modalities, and none, to our knowledge, illustrate them in the context of granular audio-visual detection or segmentation tasks such as sound source separation and localization. In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention. The use of pose keypoints is inspired by recent works that illustrate that such representations can significantly boost performance in many audio-visual scenarios where often one or more persons are responsible for the sound explicitly (e.g., talking) or implicitly (e.g., sound produced as a function of human manipulating an object). From a technical perspective, as part of the TriBERT architecture, we introduce a learned visual tokenization scheme based on spatial attention and leverage weak-supervision to allow granular cross-modal interactions for visual and pose modalities. Further, we supplement learning with sound-source separation loss formulated across all three streams. We pre-train our model on the large MUSIC21 dataset and demonstrate improved performance in audio-visual sound source separation on that dataset as well as other datasets through fine-tuning. In addition, we show that the learned TriBERT representations are generic and significantly improve performance on other audio-visual tasks such as cross-modal audio-visual-pose retrieval by as much as 66.7% in top-1 accuracy.

References67 items

Weakly-Supervised Audio-Visual Sound Source Detection and Separation

Parameter Efficient Multimodal Transformers for Video Representation Learning

Co-Attentional Transformers for Story-Based Video Understanding

Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing

End-to-End Object Detection with Transformers

Music Gesture for Visual Sound Separation

AlignNet: A Unifying Approach to Audio-Visual Alignment

Deep Audio-visual Learning: A Survey

12-in-1: Multi-Task Vision and Language Representation Learning

Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss

Self-supervised Audio-visual Co-segmentation

Co-Separating Sounds of Visual Objects

The Sound of Motions

Deep Multimodal Clustering for Unsupervised Audiovisual Learning

CBVMR: Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Learnable PINs: Cross-Modal Embeddings for Person Identity

Attention U-Net: Learning Where to Look for the Pancreas

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

The Sound of Pixels

Decoupled Spatial Neural Attention for Weakly Supervised Semantic Segmentation

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

Cross-modal Embeddings for Video and Audio Retrieval

Audio to Body Dynamics

Objects that Sound

Seeing Through Noise: Visually Driven Speaker Separation And Enhancement

Speaker-Independent Speech Separation With Deep Attractor Network

Attention is All you Need

Look, Listen and Learn

Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint

Audio-visual object localization and separation using low-rank and sparsity

See and listen: Score-informed association of sound tracks to players in chamber music performance videos

RMPE: Regional Multi-person Pose Estimation

CNN architectures for large-scale audio classification

Single-Channel Multi-Speaker Separation Using Deep Clustering

Deep Residual Learning for Image Recognition

Weakly Supervised Deep Detection Networks

Representation Learning: A Review and New Perspectives

The cocktail party problem

Audio-visual graphical models for speech processing

Ausio-visual Segmentation and "The Cocktail Party Effect"

To all authors

Signal estimation from modified short-time Fourier transform

Active Contrastive Learning of Audio-Visual Video Representations

Weakly Supervised Representation Learning for Audio-Visual Scene Analysis

2.5 d visual sound

Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation

MIR_EVAL: A Transparent Implementation of Common MIR Metrics

SOURCE-FILTER BASED CLUSTERING FOR MONAURAL BLIND SOURCE SEPARATION

A Tutorial on the Cross-Entropy Method

Learning Joint Statistical Models for Audio-Visual Fusion and Segregation

Did you state the full set of assumptions of all theoretical results

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content

data, models) or curating/releasing new assets

Did you discuss any potential negative societal impacts of your work?

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation

Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? [Yes] See abstract and contributions in Introduction (section 1)

with respect to the random seed after running experiments multiple times)? [No] We know that transformer based models are computationally expensive and take long time to train the whole network

data splits, hyperparameters, how they were chosen)? [Yes] For data splits and hyperparameters, see implementation details (section 3.2.1) and baselines

Have you read the ethics review guidelines and ensured that your paper conforms to them

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

If you used crowdsourcing or conducted research with human subjects