3260 papers • 126 benchmarks • 313 datasets
This task is a variant of the Target Sound Extraction task, with the constraint of causal streaming inference. Aiming for an algorithmic latency of less than 20 ms, at each time step, streaming audio models operate on an input audio chunk of length less than 20 ms. The causal constraint means that the model only has the knowledge of past chunks and no future chunks.
(Image credit: Papersgraph)
These leaderboards are used to track progress in streaming-target-sound-extraction-7
Use these libraries to find streaming-target-sound-extraction-7 models and implementations
No subtasks available.
Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder, is presented, the first neural network model to achieve real-time and streaming target sound extraction.
Adding a benchmark result helps the community track progress.