1
Weakly-Supervised Audio-Visual Sound Source Detection and Separation
2
Parameter Efficient Multimodal Transformers for Video Representation Learning
3
Co-Attentional Transformers for Story-Based Video Understanding
4
Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing
5
End-to-End Object Detection with Transformers
6
Music Gesture for Visual Sound Separation
7
AlignNet: A Unifying Approach to Audio-Visual Alignment
8
Deep Audio-visual Learning: A Survey
9
12-in-1: Multi-Task Vision and Language Representation Learning
10
Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events
11
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
12
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
13
Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval
14
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
15
Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss
16
Self-supervised Audio-visual Co-segmentation
17
Co-Separating Sounds of Visual Objects
19
Deep Multimodal Clustering for Unsupervised Audiovisual Learning
20
CBVMR: Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint
21
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
22
Learnable PINs: Cross-Modal Embeddings for Person Identity
23
Attention U-Net: Learning Where to Look for the Pancreas
24
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
26
Decoupled Spatial Neural Attention for Weakly Supervised Semantic Segmentation
27
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition
28
Cross-modal Embeddings for Video and Audio Retrieval
31
Seeing Through Noise: Visually Driven Speaker Separation And Enhancement
32
Speaker-Independent Speech Separation With Deep Attractor Network
33
Attention is All you Need
35
Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint
36
Audio-visual object localization and separation using low-rank and sparsity
37
See and listen: Score-informed association of sound tracks to players in chamber music performance videos
38
RMPE: Regional Multi-person Pose Estimation
39
CNN architectures for large-scale audio classification
40
Single-Channel Multi-Speaker Separation Using Deep Clustering
41
Deep Residual Learning for Image Recognition
42
Weakly Supervised Deep Detection Networks
43
Representation Learning: A Review and New Perspectives
44
The cocktail party problem
45
Audio-visual graphical models for speech processing
46
Ausio-visual Segmentation and "The Cocktail Party Effect"
48
Signal estimation from modified short-time Fourier transform
49
Active Contrastive Learning of Audio-Visual Video Representations
50
Weakly Supervised Representation Learning for Audio-Visual Scene Analysis
52
Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation
53
MIR_EVAL: A Transparent Implementation of Common MIR Metrics
54
SOURCE-FILTER BASED CLUSTERING FOR MONAURAL BLIND SOURCE SEPARATION
55
A Tutorial on the Cross-Entropy Method
56
Learning Joint Statistical Models for Audio-Visual Fusion and Segregation
57
Did you state the full set of assumptions of all theoretical results
58
Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content
59
data, models) or curating/releasing new assets
60
Did you discuss any potential negative societal impacts of your work?
61
Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation
62
Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? [Yes] See abstract and contributions in Introduction (section 1)
63
with respect to the random seed after running experiments multiple times)? [No] We know that transformer based models are computationally expensive and take long time to train the whole network
64
data splits, hyperparameters, how they were chosen)? [Yes] For data splits and hyperparameters, see implementation details (section 3.2.1) and baselines
65
Have you read the ethics review guidelines and ensured that your paper conforms to them
66
Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?
67
If you used crowdsourcing or conducted research with human subjects