PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit

Published in

North American Chapter of the Association for C...(2022)

External Links:

Generate Graph DownloadPDF

TL;DR

The design philosophy and core architecture of PaddleSpeech is described to support several essential speech- to-text and text-to-speech tasks to achieve competitive or state-of-the-art performance on various speech datasets.

Abstract

PaddleSpeech is an open-source all-in-one speech toolkit. It aims at facilitating the development and research of speech processing technologies by providing an easy-to-use command-line interface and a simple code structure. This paper describes the design philosophy and core architecture of PaddleSpeech to support several essential speech-to-text and text-to-speech tasks. PaddleSpeech achieves competitive or state-of-the-art performance on various speech datasets and implements the most popular methods. It also provides recipes and pretrained models to quickly reproduce the experimental results in this paper. PaddleSpeech is publicly avaiable at https://github.com/PaddlePaddle/PaddleSpeech.

Authors

Renjie Zheng

3 papers

Junkun Chen

3 papers

Xintong Li

2 papers

References52 items

SpeechBrain: A General-Purpose Speech Toolkit

AST: Audio Spectrogram Transformer

WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit

Automatic punctuation restoration with BERT models

NeurST: Neural Speech Translation Toolkit

PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit

Published in

North American Chapter of the Association for C...(2022)

External Links:

Generate Graph DownloadPDF

TL;DR

Abstract

Authors

Renjie Zheng

3 papers

Junkun Chen

3 papers

Xintong Li

2 papers

References52 items

SpeechBrain: A General-Purpose Speech Toolkit

AST: Audio Spectrogram Transformer

WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit

Automatic punctuation restoration with BERT models

NeurST: Neural Speech Translation Toolkit

Liang Huang

3 papers

Yuxin Huang

1 papers

Xiaojie Chen

1 papers

Xiaoguang Hu

1 papers

Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition

StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization

AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Fairseq S2T: Fast Speech-to-Text Modeling with Fairseq

SpeedySpeech: Efficient Neural Speech Synthesis

Fastpitch: Parallel Text-to-Speech with Pitch Prediction

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech

ESPnet-ST: All-in-One Speech Translation Toolkit

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

WaveFlow: A Compact Flow-based Model for Raw Audio

PyTorch: An Imperative Style, High-Performance Deep Learning Library

CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92)

Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

Learning Alignment for Multimodal Emotion Recognition from Speech

DELTA: A DEep learning based Language Technology plAtform

MuST-C: a Multilingual Speech Translation Corpus

ERNIE: Enhanced Representation through Knowledge Integration

Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systems

Deep Segment Attentive Embedding for Duration Robust Speaker Verification

Neural Speech Synthesis with Transformer Network

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Automatic acoustic detection of birds through deep learning: The first Bird Audio Detection challenge

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

ESPnet: End-to-End Speech Processing Toolkit

Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline

Attention is All you Need

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Audio Set: An ontology and human-labeled dataset for audio events

Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

ESC: Dataset for Environmental Sound Classification

Librispeech: An ASR corpus based on public domain audio books

CROWDMOS: An approach for crowdsourcing mean opinion score studies

Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices

Julius - an open source real-time large vocabulary recognition engine

The lj speech dataset

Overview of the IWSLT 2012 evaluation campaign

The Kaldi Speech Recognition Toolkit

RASR - The RWTH Aachen University Open Source Speech Recognition Toolkit

This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Tensorflow: a System for Large-scale Machine Learning Tensorflow: a System for Large-scale Machine Learning

Alexandre de Brébisson

Field of Study

Computer ScienceEngineering

Journal Information

Name

ArXiv

Volume

abs/2005.00687

Venue Information

Name

North American Chapter of the Association for Computational Linguistics

Type

conference

URL

https://www.aclweb.org/portal/naacl

Alternate Names

North Am Chapter Assoc Comput Linguistics
NAACL

TL;DR

Abstract

Authors

References52 items

SpeechBrain: A General-Purpose Speech Toolkit

AST: Audio Spectrogram Transformer

WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit

Automatic punctuation restoration with BERT models

NeurST: Neural Speech Translation Toolkit

TL;DR

Abstract

Authors

References52 items

SpeechBrain: A General-Purpose Speech Toolkit

AST: Audio Spectrogram Transformer

WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit

Automatic punctuation restoration with BERT models

NeurST: Neural Speech Translation Toolkit

Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition

StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization

AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Fairseq S2T: Fast Speech-to-Text Modeling with Fairseq

SpeedySpeech: Efficient Neural Speech Synthesis

Fastpitch: Parallel Text-to-Speech with Pitch Prediction

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech

ESPnet-ST: All-in-One Speech Translation Toolkit

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

WaveFlow: A Compact Flow-based Model for Raw Audio

PyTorch: An Imperative Style, High-Performance Deep Learning Library

CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92)

Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

Learning Alignment for Multimodal Emotion Recognition from Speech

DELTA: A DEep learning based Language Technology plAtform

MuST-C: a Multilingual Speech Translation Corpus

ERNIE: Enhanced Representation through Knowledge Integration

Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systems

Deep Segment Attentive Embedding for Duration Robust Speaker Verification

Neural Speech Synthesis with Transformer Network

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Automatic acoustic detection of birds through deep learning: The first Bird Audio Detection challenge

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

ESPnet: End-to-End Speech Processing Toolkit

Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline

Attention is All you Need

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Audio Set: An ontology and human-labeled dataset for audio events

Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

ESC: Dataset for Environmental Sound Classification

Librispeech: An ASR corpus based on public domain audio books

CROWDMOS: An approach for crowdsourcing mean opinion score studies

Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices

Julius - an open source real-time large vocabulary recognition engine

The HTK book

The lj speech dataset

Overview of the IWSLT 2012 evaluation campaign

The Kaldi Speech Recognition Toolkit

RASR - The RWTH Aachen University Open Source Speech Recognition Toolkit

This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Tensorflow: a System for Large-scale Machine Learning Tensorflow: a System for Large-scale Machine Learning

Alexandre de Brébisson

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL

Alternate Names