3260 papers • 126 benchmarks • 313 datasets
Sound Event Detection (SED) is the task of recognizing the sound events and their respective temporal start and end time in a recording. Sound events in real life do not always occur in isolation, but tend to considerably overlap with each other. Recognizing such overlapping sound events is referred as polyphonic SED. Source: A report on sound event detection with different binaural features
(Image credit: Papersgraph)
These leaderboards are used to track progress in sound-event-detection-10
Use these libraries to find sound-event-detection-10 models and implementations
No subtasks available.
This work studies the adversarial robustness of neural networks through the lens of robust optimization, and suggests the notion of security against a first-order adversary as a natural and broad security guarantee.
The parameterization of hypercomplex convolutional layers is defined and the family of parameterized hypercomplex neural networks (PHNNs) that are lightweight and efficient large-scale models that are lightweight and efficient large-scale models are introduced.
This work introduces WavCaps, the first large-scale weakly-labelled audio captioning dataset, and proposes a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
This paper presents an approach to polyphonic sound event detection in real life recordings based on bi-directional long short term memory (BLSTM) recurrent neural networks (RNNs).
This paper treats SED as a multiple instance learning (MIL) problem, where training labels are static over a short excerpt, indicating the presence or absence of sound sources but not their temporal locality, and develops a family of adaptive pooling operators—referred to as autopool—which smoothly interpolate between common pooling Operators, and automatically adapt to the characteristics of the sound sources in question.
Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data, and it is shown that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.
The proposed framework (SELD-TCN) outperforms the state-of-the-art SELDnet performance on four different datasets and achieves 4x faster training time per epoch and 40x faster inference time on an ordinary graphics processing unit (GPU).
In experimental evaluations with the DCASE 2020 Task 3 dataset, the ACCDOA representation outperformed the two-branch representation in SELD metrics with a smaller network size and performed better than state-of-the-art SELD systems in terms of localization and location-dependent detection.
An effective Couple Learning method that combines a well-trained model and a Mean Teacher model that increases strongly- and weakly-labeled data and reduces the noise impact in the pseudo-labels introduced by detection errors is proposed.
A random consistency training (RCT) strategy is proposed to fuse with the teacher-student model to stabilize the training, and a hard mixup data augmentation is proposed to account for the additive property of sounds.
Adding a benchmark result helps the community track progress.