NusaCrowd: Open Source Initiative for Indonesian NLP Resources (2022-12-19T00:00:00.000000Z)

TL;DR

NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia, including opening access to previously non-public resources.

Abstract

We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.

Authors

References200 items

Robust Speech Recognition via Large-Scale Weak Supervision

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

IDK-MRC: Unanswerable Questions for Indonesian Machine Reading Comprehension

Language Models are Multilingual Chain-of-Thought Reasoners

Masader Plus: A New Interface for Exploring +500 Arabic NLP Datasets

No Language Left Behind: Scaling Human-Centered Machine Translation

Emotion dataset from Indonesian public opinion

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

Predicting the Category and the Length of Punishment in Indonesian Courts Based on Previous Court Decision Documents

Lifting the Curse of Multilinguality by Pre-training Modular Transformers

Sentiment Analysis in Karonese Tweet using Machine Learning

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation

IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages

Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

CVSS Corpus and Massively Multilingual Speech-to-Speech Translation

Analisis Perbandingan Nilai Akurasi Mekanisme Attention Bahdanau dan Luong pada Neural Machine Translation Bahasa Indonesia ke Bahasa Melayu Ketapang dengan Arsitektur Recurrent Neural Network

Few-shot Learning with Multilingual Generative Language Models

FLAVA: A Foundational Language And Vision Alignment Model

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

IndoNLI: A Natural Language Inference Dataset for Indonesian

Causal and Masked Language Modeling of Javanese Language using Transformer-based Architectures

Masader: Metadata Sourcing for Arabic Text and Speech Data Resources

Visually Grounded Reasoning across Languages and Cultures

Pre-trained transformer-based language models for Sundanese

Greenformer: Factorization Toolkit for Efficient Deep Neural Networks

IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization

CoVoST 2 and Massively Multilingual Speech Translation

XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages

X-Fact: A New Benchmark Dataset for Multilingual Fact Checking

Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas

AM2iCo: Evaluating Word Meaning in Context across Low-Resource Languages with Adversarial Examples

Code-mixed sentiment analysis of Indonesian language and Javanese language using Lexicon based approach

ALUE: Arabic Language Understanding Evaluation

MasakhaNER: Named Entity Recognition for African Languages

Multimodal End-to-End Sparse Model for Emotion Recognition

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

Effect of mono corpus quantity on statistical machine translation Indonesian – Lampung dialect of nyo

BinaryBERT: Pushing the Limit of BERT Quantization

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

VOXLINGUA107: A Dataset for Spoken Language Recognition

Attention-based CNN-BiLSTM for Dialect Identification on Javanese Text

Sundanese Twitter Dataset for Emotion Classification

IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

Liputan6: A Large-scale Indonesian Dataset for Text Summarization

Tree Rotations for Dependency Trees: Converting the Head-Directionality of Noun Phrases

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization

Short Answer Grading Using Contextual Word Embedding and Linear Regression

TernaryBERT: Distillation-aware Ultra-low Bit BERT

Towards Computational Linguistics in Minangkabau Language: Studies on Sentiment Analysis and Machine Translation

IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding

Parsing Indonesian Sentence into Abstract Meaning Representation using Machine Learning Approach

Sequence-to-Sequence Learning for Indonesian Automatic Question Generator

CLICK-ID: A novel dataset for Indonesian clickbait headlines

TICO-19: the Translation Initiative for Covid-19

Compressing Neural Machine Translation Models with 4-bit Precision

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

English Intermediate-Task Training Improves Zero-Shot Cross-Lingual Transfer Too

XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning

Building the Old Javanese Wordnet

Cross-Lingual Machine Speech Chain for Javanese, Sundanese, Balinese, and Bataks Speech Recognition and Synthesis

MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer

CLUE: A Chinese Language Understanding Evaluation Benchmark

XGLUE: A New Benchmark Datasetfor Cross-lingual Pre-training, Understanding and Generation

Improving the role of language model in statistical machine translation (Indonesian-Javanese)

The State and Fate of Linguistic Diversity and Inclusion in the NLP World

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

XPersona: Evaluating Multilingual Personalized Chatbot

PhoBERT: Pre-trained language models for Vietnamese

THE LANGUAGE CHOICE OF CHINESE COMMUNITY IN MEDAN: A SOCIOLINGUISTICS STUDY

Abusive Language Detection on Indonesian Online News Comments

Zero-Shot Code-Switching ASR and TTS with Multilingual Machine Speech Chain

Unsupervised Cross-lingual Representation Learning at Scale

Converting an Indonesian Constituency Treebank to the Penn Treebank Format

Normalization of Indonesian-English Code-Mixed Twitter Data

Lightweight and Efficient End-To-End Speech Recognition Using Low-Rank Transformer

CORD: A Consolidated Receipt Dataset for Post-OCR Parsing

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph Vector

Improving Joint Layer RNN based Keyphrase Extraction by Using Syntactical Features

Aspect and Opinion Terms Extraction Using Double Embeddings and Attention Mechanism for Indonesian Hotel Reviews

Multi-label Aspect Categorization with Convolutional Neural Networks and Extreme Gradient Boosting

KaWAT: A Word Analogy Task Dataset for Indonesian

Hate Speech Detection on Indonesian Long Text Documents Using Machine Learning Approach

Pengaruh Kuantitas Korpus Monolingual Terhadap Akurasi Mesin Penerjemah Statistik

Peningkatan Mesin Penerjemah Statistik dengan Menambah Kuantitas Korpus Monolingual (Studi Kasus : Bahasa Indonesia - Sunda)

Penggunaan Pivot Language pada Mesin Penerjemah Statistik Bahasa Inggris ke Bahasa Melayu Sambas

Chinese Ethnic Communication Pattern in the Environment of Indigenous People in Lhokseumawe, Indonesia

Aspect Detection and Sentiment Classification Using Deep Neural Network for Indonesian Aspect-Based Sentiment Analysis

Emotion Classification on Indonesian Twitter Dataset

Colloquial Indonesian Lexicon

Stance Classification Towards Political Figures on Blog Writing

Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger

100

Indosum: A New Benchmark Dataset for Indonesian Text Summarization

101

Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali

102

A Step-by-Step Process for Building TTS Voices Using Open Source Data and Frameworks for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese

103

Dialect and Identity: A Case Study of Javanese Use in WhatsApp and Line

104

Vocabulary Alignment for Collaborative Agents: a Study with Real-World Multilingual How-to Instructions

105

Pengaruh Metode Dictionary Lookup pada Cleaning Korpus Terhadap Akurasi Mesin Penerjemah Statistik Indonesia–Melayu Pontianak

106

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

107

When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?

108

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

109

CLASSIFICATION OF CUSTOMERS EMOTION USING NAÏVE BAYES CLASSIFIER (Case Study: Natasha Skin Care)

110

Inset lexicon: Evaluation of a word list for Indonesian sentiment analysis in microblogs

111

Study of hoax news detection using naïve bayes classifier in Indonesian language

112

Modified DBpedia entities expansion for tagging automatically NER dataset

113

Hate speech detection in the Indonesian language: A dataset and preliminary study

114

Automatic open domain information extraction from Indonesian text

115

Experiments on coreference resolution for Indonesian language with lexical and shallow syntactic features

116

Meningkatkan Akurasi Pada Mesin Penerjemah Bahasa Indonesia Ke Bahasa Melayu Pontianak Dengan Part Of Speech

117

TUNING FOR QUALITY UNTUK UJI AKURASI MESIN PENERJEMAH STATISTIK (MPS) BAHASA INDONESIA - BAHASA DAYAK KANAYATN

118

Multilingualism and the West Kalimantan Hakka

119

KERANCUAN FONO-ORTOGRAFIS DAN ORTO-FONOLOGIS BAHASA INDONESIA RAGAM LISAN DAN TULIS

120

Multilingual Open Relation Extraction Using Cross-lingual Projection

121

Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus

122

Creating Indonesian-Javanese parallel corpora using wikipedia articles

123

LOCAL LANGUAGES IN INDONESIA: LANGUAGE MAINTENANCE OR LANGUAGE SHIFT?

124

Code Switching and Code Mixing in Indonesia: Study in Sociolinguistics?

125

Towards language preservation: Design and collection of graphemically balanced and parallel speech corpora of Indonesian ethnic languages

126

Universal Dependency Annotation for Multilingual Parsing

127

Towards language preservation: Preliminary collection and vowel analysis of Indonesian ethnic speech data

128

IDENTIC Corpus: Morphologically Enriched Indonesian-English Parallel Corpus

129

PERUBAHAN DAN PERKEMBANGAN BAHASA: Tinjauan Historis dan Sosiolinguistik

130

Ethnologue

131

A Grammar of Madurese

132

Resource Report: Building Parallel Text Corpora for Multi-Domain Translation System

133

A Two-Level Morphological Analyser for the Indonesian Language

134

A machine learning approach for indonesian question answering system

135

The Indonesian Language: Its History and Role in Modern Society

136

SALSA version 3.0: a single recognizer-based multilingual speech-based web browser

137

Ethnologue: Languages of the World

138

SALSA version 1.0: a speech-based web browser for hong kong English

139

Crosslingual Generalization through Multitask Finetuning

140

Poetry Generation for Indonesian Pantun : Comparison Between SeqGAN and GPT-2

141

cld3: Google’s Compact Language Detector 3

142

Postagged sundanese monolingual corpus

143

IndicXTREME: A Multi-Task Benchmark For Evaluating Indic Languages

144

Bigbio: A framework for datacentric biomedical natural language processing

145

The State of Multilingual AI

146

Kyokushoushugi ni motoduku heiretsu tsuriibanku no kouchiku [building a parallel treebank based on minimalism

147

Normalisation of Indonesian-English Code-Mixed Text and its Effect on Emotion Classification

148

Abusive Language and Hate Speech Detection for Javanese and Sundanese Languages in Tweets: Dataset and Preliminary Study

149

A Multi-Pass Sieve Coreference Resolution for Indonesian

150

MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer

151

IndoCollex: A testbed for morphological transformation of Indonesian colloquial words

152

WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models

153

Multilingual Translation from Denoising Pre-Training

154

SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages

155

1,3 juta anak di ntt belum bisa berbahasa indonesia

156

2021b. From masked language modeling to translation: Non-english auxiliary tasks

157

2021b. IndoNLG: Benchmark and resources

158

Benchmarking multidomain EnglishIndonesian machine translation

159

On the Syntax of West Kalimantan: Asymmetries and A’-Movement in Malayic and Land Dayak Languages

160

International Conference on Asian Language Processing (IALP), pages 310–315

161

National Strategy for Artificial Intelligence 2020-2045 (2020) (Indonesian)

162

IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian

163

Costs to consider in adopting nlp for your business

164

Building Cendana: a Treebank for Informal Indonesian

165

Interpersonal meaning annotation for Asian language corpora: The case of TUFS Asian Language Parallel Corpus (TALPCo)

166

A gold standard dependency treebank for Indonesian

167

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

168

Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter

169

Media Elektronik : 2580-0760 Penggunaan Bahasa Indonesia sebagai Pivot Language pada Mesin Penerjemah Madura-Sunda dengan Metode Transfer dan Triangulation

170

Pembangkitan deskripsi gambar dalam bahasa indonesia dengan pendekatan semantic compositional networks

171

A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media

172

Cross-Lingual and Supervised Learning Approach for Indonesian Word Sense Disambiguation Task

173

Xnli: Evaluating crosslingual sentence representations

174

Semi-supervised Textual Entailment on Indonesian Wikipedia Data

175

Pengembangan kemampuan berbahasa indonesia siswa sekolah dasar desa terpencil melalui metode karyawisata berbasis potensi lokal

176

Cross-lingual Name Tagging and Linking for 282 Languages

177

Syntactic Variation Of Buginese, A Language In Austronesian Great Family

178

PERFECTIVE ASPECT AND NEGATION IN PONTIANAK TEOCHEW

179

Recent progress in developing grapheme-based speech recognition for Indonesian ethnic languages: Javanese, Sundanese, Balinese and Bataks

180

Named entity recognition for Indonesian text using hidden Markov model

181

Usage of Indonesian possessive verbal predicates: a statistical analysis based on questionnaire and storytelling surveys

182

Sampiran

183

Sundanese complementation

184

langid

185

Distributed speech translation technologies

186

Head-final and head-initial relative clauses in jambi teochew

187

Quality and Intelligibility Assessment of Indonesian HMM-Based Speech Synthesis System

188

The Austronesian Languages

189

Voice and verb morphology in Minangkabau, a language of West Sumatra, Indonesia

190

Development of Indonesian Large Vocabulary Continuous Speech Recognition System within A-STAR Project

191

Development of HMM-based Indonesian Speech Synthesis

192

Malay dialects of the Batanghari river basin (Jambi, Sumatra)

193

A Large Vocabulary Continuous Speech Recognition System for Indonesian Language

194

Karo batak

195

Indonesian speech recognition for hearing and speaking impaired people

196

Balinese morphosyntax: a lexical-functional approach

197

JATI will be employed to build an ontology, in which knowledge is extracted from the semantic representation in Minimal Recursion Semantics (MRS) (Copestake

198

Preferred argument structure in an active language: Arguments against the category ‘intransitive subject’

199

NgajuDayak Language

200