3260 papers • 126 benchmarks • 313 datasets
Audio generation (synthesis) is the task of generating raw audio such as speech. ( Image credit: MelNet )
(Image credit: Papersgraph)
These leaderboards are used to track progress in audio-generation-29
Use these libraries to find audio-generation-29 models and implementations
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Through extensive empirical investigations on the NSynth dataset, it is demonstrated that GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts.
SaShiMi, a new multi-scale architecture for waveform modeling built around the recently introduced S4 model for long sequence modeling, is proposed, identifying that S4 can be unstable during autoregressive generation, and providing a simple improvement to its parameterization by drawing connections to Hurwitz matrices.
This work designs a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve, and applies it to a variety of audio generation tasks, showing improvements over previous approaches in both density estimates and human judgments.
This work uses the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis.
It is shown that the model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature.
A new audio processing technique that increases the sampling rate of signals such as speech or music using deep convolutional neural networks, and demonstrates the effectiveness of feed-forward Convolutional architectures on an audio generation task.
This work introduces a high-fidelity universal neural audio compression algorithm that achieves ~90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth by combining advances in high- fidelity audio generation with better vector quantization techniques from the image domain, along with improved adversarial and reconstruction losses.
The proposed model generates notes as magnitude spectrograms from any probabilistic latent code samples, with expressive control of orchestral timbres and playing styles, and can be applied to other sound domains, including an user's libraries with custom sound tags that could be mapped to specific generative controls.
Adding a benchmark result helps the community track progress.