3260 papers • 126 benchmarks • 313 datasets
Target Sound Extraction is the task of extracting a sound corresponding to a given class from an audio mixture. The audio mixture may contain background noise with a relatively low amplitude compared to the foreground mixture components. The choice of the sound class is provided as input to the model in form of a string, integer, or a one-hot encoding of the sound class.
(Image credit: Papersgraph)
These leaderboards are used to track progress in target-sound-extraction-4
Use these libraries to find target-sound-extraction-4 models and implementations
Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder, is presented, the first neural network model to achieve real-time and streaming target sound extraction.
Experimental results for seen and unseen target-sound evaluation sets show that the proposed TSE model can effectively deal with a varying number of clues which improves the TSE performance and robustness against partially compromised clues.
DPM-TSE is introduced, a generative method based on diffusion probabilistic modeling (DPM) for Target Sound Extraction (TSE), to achieve both cleaner target renderings as well as improved separability from unwanted sounds.
Imagine being able to listen to the birds chirping in a park without hearing the chatter from other hikers, or being able to block out traffic noise on a busy street while still being able to hear emergency sirens and car honks. We introduce semantic hearing, a novel capability for hearable devices that enables them to, in real-time, focus on, or ignore, specific sounds from real-world environments, while also preserving the spatial cues. To achieve this, we make two technical contributions: 1) we present the first neural network that can achieve binaural target sound extraction in the presence of interfering sounds and background noise, and 2) we design a training methodology that allows our system to generalize to real-world use. Results show that our system can operate with 20 sound classes and that our transformer-based network has a runtime of 6.56 ms on a connected smartphone. In-the-wild evaluation with participants in previously unseen indoor and outdoor scenarios shows that our proof-of-concept system can extract the target sounds and generalize to preserve the spatial cues in its binaural output. Project page with code: https://semantichearing.cs.washington.edu
This paper tailor and adapt the powerful contrastive language-audio pre-trained model (CLAP) for USS, denoted as CLAPSep, and shows superior performance and zero- and few-shot generalizability of the proposed CLAPSep with fast training convergence, surpassing previous methods by a significant margin.
Adding a benchmark result helps the community track progress.