3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in audio-compression-10
No benchmarks available.
Use these libraries to find audio-compression-10 models and implementations
No datasets available.
No subtasks available.
A novel loss balancer mechanism to stabilize training is introduced: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss.
This work introduces a high-fidelity universal neural audio compression algorithm that achieves ~90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth by combining advances in high- fidelity audio generation with better vector quantization techniques from the image domain, along with improved adversarial and reconstruction losses.
The aim is to address the lack of a principled treatment of data acquired indistinctly in the temporal and frequency domains in a way that is robust to missing or noisy observations, and that at the same time models uncertainty effectively.
UR-AIR system submission to the logical access (LA) and the speech deepfake (DF) tracks of the ASVspoof 2021 Challenge is presented and a channel-robust synthetic speech detection system for the challenge is proposed.
A deep convolutional GAN is presented which leverages techniques from MP3/Vorbis audio compression to produce long, high-quality audio samples with long-range coherence and leverage the auditory masking and psychoacoustic perception limit of the human ear to widen the true distribution and stabilize the training process.
This paper proposes a novel approach for reconstructing higher frequencies from considerably longer sequences of low-quality MP3 audio waves by inpainting audio spectrograms with residually stacked autoencoder blocks by manipulating individual amplitude and phase values in relation to perceptual differences.
This work proposes overfitting variational Bayesian neural networks to the data and compressing an approximate posterior weight sample using relative entropy coding instead of quantizing and entropy coding it, which enables direct optimization of the rate-distortion performance by minimizing the $\beta$-ELBO.
A new input representation and simple architecture are proposed to achieve improved prosody modeling in TTS and it is demonstrated that the reference encoder learns better speaker-independent prosody when discrete code is utilized as input in the experiments.
A signal decomposition method is proposed to isolate the spatial error, in terms of interchannel gain leakages and changes in relative delays, from a processed signal, which allows the computation of simple energy-ratio metrics, providing objective measures of spatial and non-spatial signal qualities, with minimal assumptions and no dataset dependency.
Adding a benchmark result helps the community track progress.