CLAPSep: Leveraging Contrastive Pre-Trained Model for Multi-Modal Query-Conditioned Target Sound Extraction

Published in

IEEE/ACM Transactions on Audio Speech and Langu...(2024)

External Links:

Generate Graph

TL;DR

This paper tailor and adapt the powerful contrastive language-audio pre-trained model (CLAP) for USS, denoted as CLAPSep, and shows superior performance and zero- and few-shot generalizability of the proposed CLAPSep with fast training convergence, surpassing previous methods by a significant margin.

Abstract

Universal sound separation (USS) aims to extract arbitrary types of sounds from real-world recordings. This can be achieved by language-queried target sound extraction (TSE), which typically consists of two components: a query network that converts user queries into conditional embeddings, and a separation network that extracts the target sound accordingly. Existing methods commonly train models from scratch. As a consequence, substantial data and computational resources are required to make the randomly initialized model comprehend sound events and perform separation accordingly. In this paper, we propose to integrate pre-trained models into TSE models to address the above issue. To be specific, we tailor and adapt the powerful contrastive language-audio pre-trained model (CLAP) for USS, denoted as CLAPSep. CLAPSep also accepts flexible user inputs, taking both positive and negative user prompts of uni- and/or multi-modalities for target sound extraction. These key features of CLAPSep can not only enhance the extraction performance but also improve the versatility of its application. We provide extensive experiments on 5 diverse datasets to demonstrate the superior performance and zero- and few-shot generalizability of our proposed CLAPSep with fast training convergence, surpassing previous methods by a significant margin. Full codes and some audio examples are released for reproduction and evaluation.

Authors

Hao Ma

1 papers

Zhiyuan Peng

1 papers

Xu Li

1 papers

References59 items

A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining

Listen, Chat, and Remix: Text-Guided Soundscape Remixing for Enhanced Auditory Experience

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

Consistent and Relevant: Rethink the Query Embedding in General Sound Separation

CLAPSep: Leveraging Contrastive Pre-Trained Model for Multi-Modal Query-Conditioned Target Sound Extraction

Published in

IEEE/ACM Transactions on Audio Speech and Langu...(2024)

External Links:

Generate Graph

TL;DR

Abstract

Authors

Hao Ma

1 papers

Zhiyuan Peng

1 papers

Xu Li

1 papers

References59 items

A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining

Listen, Chat, and Remix: Text-Guided Soundscape Remixing for Enhanced Auditory Experience

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

Consistent and Relevant: Rethink the Query Embedding in General Sound Separation

Mingjie Shao

1 papers

Extending Whisper with Prompt Tuning to Target-Speaker ASR

Audio Prompt Tuning for Universal Sound Separation

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

Retrieval-Augmented Text-to-Audio Generation

Training Audio Captioning Models without Audio

AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining

Separate Anything You Describe

Universal Source Separation with Weakly Labelled Data

Target Sound Extraction with Variable Cross-Modality Clues

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Real-Time Target Sound Extraction

Music Source Separation With Band-Split RNN

Text-Driven Separation of Arbitrary Sounds

SoundBeam: Target Sound Extraction Conditioned on Sound-Class Labels and Enrollment Clues for Increased Performance and Continuous Learning

Separate What You Describe: Language-Queried Audio Source Separation

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Image Segmentation Using Text and Image Prompts

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Extract Free Dense Labels from CLIP

Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation

LoRA: Low-Rank Adaptation of Large Language Models

Learning Transferable Visual Models From Natural Language Supervision

Attention Is All You Need In Speech Separation

Dense CNN With Self-Attention for Time-Domain Speech Enhancement

Unsupervised Sound Separation Using Mixture Invariant Training

Listen to What You Want: Neural Network-based Universal Sound Selector

AudioCaps: Generating Captions for Audios in The Wild

Universal Sound Separation

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Improved Speech Enhancement with the Wave-U-Net

SDR – Half-baked or Well Done?

Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation

General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline

An Overview of Lead and Accompaniment Separation in Music

The Sound of Pixels

Decoupled Weight Decay Regularization

FiLM: Visual Reasoning with a General Conditioning Layer

Supervised Speech Separation Based on Deep Learning: An Overview

Attention is All you Need

Audio Set: An ontology and human-labeled dataset for audio events

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

ESC: Dataset for Environmental Sound Classification

U-Net: Convolutional Networks for Biomedical Image Segmentation

Freesound technical demo

“Zero-shotaudiosourceseparationviaquery-basedlearningfromweakly-labeled data,”in

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Visualizing Data using t-SNE

ii) The CLAPSep can ﬂexibly incorporate audio and/or language, as well as positive and/or negative queries. These

IEEE) received the B.S. degree from the Xidian University, Xi’an, China, in 2015, and the Ph.D. degree from the Chinese University of Hong Kong (CUHK), Hong Kong, in

We provide extensive experiments to compare our method with previous benchmarks

Field of Study

Computer ScienceEngineering

Journal Information

Name

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Page

1702-1726

Volume

Venue Information

Name

IEEE/ACM Transactions on Audio Speech and Language Processing

Type

URL

https://ieeexplore.ieee.org/servlet/opac?punumber=6570655

Alternate Names

IEEE/ACM Trans Audio Speech Lang Process

TL;DR

Abstract

Authors

References59 items

A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining

Listen, Chat, and Remix: Text-Guided Soundscape Remixing for Enhanced Auditory Experience

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

Consistent and Relevant: Rethink the Query Embedding in General Sound Separation

TL;DR

Abstract

Authors

References59 items

A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining

Listen, Chat, and Remix: Text-Guided Soundscape Remixing for Enhanced Auditory Experience

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

Consistent and Relevant: Rethink the Query Embedding in General Sound Separation

Extending Whisper with Prompt Tuning to Target-Speaker ASR

Audio Prompt Tuning for Universal Sound Separation

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker Extraction

Retrieval-Augmented Text-to-Audio Generation

Training Audio Captioning Models without Audio

AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining

Separate Anything You Describe

Universal Source Separation with Weakly Labelled Data

Target Sound Extraction with Variable Cross-Modality Clues

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Real-Time Target Sound Extraction

Music Source Separation With Band-Split RNN

Text-Driven Separation of Arbitrary Sounds

SoundBeam: Target Sound Extraction Conditioned on Sound-Class Labels and Enrollment Clues for Increased Performance and Continuous Learning

Separate What You Describe: Language-Queried Audio Source Separation

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Image Segmentation Using Text and Image Prompts

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Extract Free Dense Labels from CLIP

Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation

LoRA: Low-Rank Adaptation of Large Language Models

Learning Transferable Visual Models From Natural Language Supervision

Attention Is All You Need In Speech Separation

Dense CNN With Self-Attention for Time-Domain Speech Enhancement

Unsupervised Sound Separation Using Mixture Invariant Training

Listen to What You Want: Neural Network-based Universal Sound Selector

AudioCaps: Generating Captions for Audios in The Wild

Universal Sound Separation

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Improved Speech Enhancement with the Wave-U-Net

SDR – Half-baked or Well Done?

Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation

General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline

An Overview of Lead and Accompaniment Separation in Music

The Sound of Pixels

Decoupled Weight Decay Regularization

FiLM: Visual Reasoning with a General Conditioning Layer

Supervised Speech Separation Based on Deep Learning: An Overview

Attention is All you Need

Audio Set: An ontology and human-labeled dataset for audio events

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

ESC: Dataset for Environmental Sound Classification

U-Net: Convolutional Networks for Biomedical Image Segmentation

Freesound technical demo

“Zero-shotaudiosourceseparationviaquery-basedlearningfromweakly-labeled data,”in

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Visualizing Data using t-SNE

QiluYoungScholar

ii) The CLAPSep can ﬂexibly incorporate audio and/or language, as well as positive and/or negative queries. These

IEEE) received the B.S. degree from the Xidian University, Xi’an, China, in 2015, and the Ph.D. degree from the Chinese University of Hong Kong (CUHK), Hong Kong, in

We provide extensive experiments to compare our method with previous benchmarks

Field of Study

Journal Information

Name

Page

Volume

Venue Information

Name

Type