ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications (2022-11-08T00:00:00.000000Z)

TL;DR

The ATCO2 corpus is introduced, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data, and will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.

Abstract

Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC (low-resource domain), large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU). In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. The ATCO2 corpus covers 1) data collection and pre-processing, 2) pseudo-annotations of speech data, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold annotations for named-entity recognition (callsign, command, value). 2) The ATCO2-PL-set corpus consists of 5281 hours of unlabeled ATC data enriched with automatic transcripts from an in-domain speech recognizer, contextual information, speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. Both available for purchase through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. 3) The ATCO2-test-set-1h corpus is a one-hour subset from the original test set corpus, that we are offering for free at https://www.atco2.org/data. We expect the ATCO2 corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.

References80 items

Comparing supervised and self-supervised embedding for ExVo Multi-Task learning track

What do we really know about State of the Art NER?

Call-Sign Recognition and Understanding for Noisy Air-Traffic Transcripts Using Surveillance Information

How Does Pre-Trained Wav2Vec 2.0 Perform on Domain-Shifted Asr? an Extensive Benchmark on Air Traffic Control Communications

A Two-Step Approach to Leverage Contextual Data: Speech Recognition in Air-Traffic Communications

Automatic Processing Pipeline for Collecting and Annotating Air-Traffic Voice Communication Data

Bertraffic: Bert-Based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications

Automated Interpretation of Air Traffic Control Communication: The Journey from Spoken Words to a Deeper Understanding of the Meaning

Datasets: A Community Library for Natural Language Processing

Robust Command Recognition for Lithuanian Air Traffic Control Tower Utterances

Boosting of Contextual Information in ASR for Air-Traffic Call-Sign Recognition

Improving callsign recognition with air-surveillance data in air-traffic communication

A comprehensive survey on sentiment analysis: Approaches, challenges and trends

Contextual Semi-Supervised Learning: An Approach To Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems

Detecting English Speech in the Air Traffic Control Voice Communication

Spoken Instruction Understanding in Air Traffic Control: Challenge, Technique, and Application

Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks

Automatic Call Sign Detection: Matching Air Surveillance Data with Air Traffic Spoken Communications

A Survey on Recent Advances in Sequence Labeling from Deep Learning Models

Transformers: State-of-the-Art Natural Language Processing

Automatic Speech Recognition Benchmark for Air-Traffic Communications

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Common Voice: A Massively-Multilingual Speech Corpus

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Air traffic control communication (ATCC) speech corpora and their use for ASR and TTS development

The Airbus Air Traffic Control speech recognition 2018 challenge: towards ATC automatic transcription and call sign detection

Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks

Semi-supervised Adaptation of Assistant Based Speech Recognition Models for different Approach Areas

Ontology for Transcription of ATC Speech Commands of SESAR 2020 Solution PJ.16-04

A Survey on Recent Advances in Named Entity Recognition from Deep Learning models

Analysis of BUT-PT Submission for NIST LRE 2017

Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces

A Real-life, French-accented Corpus of Air Traffic Control Communications

Decoupled Weight Decay Regularization

Semi-Supervised Learning with Semantic Knowledge Extraction for Improved Speech Recognition in Air Traffic Control

Increasing ATM Efficiency withAssistant Based Speech Recognition

The First Cross-Lingual Challenge on Recognition, Normalization, and Matching of Named Entities in Slavic Languages

Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI

Reducing controller workload with automatic speech recognition

Gaussian Error Linear Units (GELUs)

Pattern Based Sequence Classification

Librispeech: An ASR corpus based on public domain audio books

Speaker adaptation of neural network acoustic models using i-vectors

On the difficulty of training recurrent neural networks

A speech interface for air traffic control terminals

Automated speech recognition in ATC environment

TED-LIUM: an Automatic Speech Recognition dedicated corpus

Linguistic Analysis of English Phraseology and Plain Language in Air-Ground Communication

Natural Language Processing (Almost) from Scratch

AN AUTOMATED SIMULATION PILOT CAPABILITY TO SUPPORT ADVANCED AIR TRAFFIC CONTROLLER TRAINING

The ATCOSIM Corpus of Non-Prompted Clean Air Traffic Control Speech

Vocalise: assessing the impact of data link technology on the R/T channel

The AMI meeting corpus

Message Understanding Conference- 6: A Brief History

Connectionist Speech Recognition: A Hybrid Approach

Hybrid Neural Network/Hidden Markov Model Systems for Continuous Speech Recognition

SWITCHBOARD: telephone speech corpus for research and development

The ATIS Spoken Language Systems Pilot Corpus

Adapting probability-transitions in DP matching processing for an oral task-oriented dialogue

Microcomputer System Integration for Air Control Training

An assessment of the technology of automatic speech recognition for military applications

International Civil Aviation Organization

Legal and Ethical Challenges in Recording Air Traffic Control Speech

Readback Error Detection by Automatic Speech Recognition to Increase ATM Safety

Grammar Based Identification Of Speaker Role For Improving ATCO And Pilot ASR

of the Fourteenth USA/Europe Air Traffic Management Research and Development Seminar (ATM2021)

“ICAO phraseology reference guide,”

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Dropout: a simple way to prevent neural networks from overfitting

The Kaldi Speech Recognition Toolkit

Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis

Design and characterization of the non-native military air traffic communications database (nnMATC)

The HIWIRE database, a noisy and non-native English speech corpus for cockpit communication

The Air Traffic Control Corpus (ATC0) - LDC94S14A

</tags>: extra metadata

</text>: ground truth transcripts with high-level entities annotations (callsigns, commands and values)

</speaker>: speaker information to identifiy whether the segment is from an ATCO or pilot. Unknown cases are tagged with <UNK> • <text>

</segment>: one sample of data. One recording may have one or more segments

</end>: timing information with speech activity by the speakers