The unreasonabl… (2020-07-30T00:00:00.000000Z)

Abstract

Motivated by the seemingly high accuracy levels of machine learning (ML) models in Moldavian versus Romanian dialect identification and the increasing research interest on this topic, we provide a follow‐up on the Moldavian versus Romanian Cross‐Dialect Topic Identification (MRC) shared task of the VarDial 2019 evaluation campaign. The shared task included two subtask types: one that consisted in discriminating between the Moldavian and Romanian dialects and one that consisted in classifying documents by topic across the two dialects of Romanian. Participants achieved impressive scores, for example, the top model for Moldavian versus Romanian dialect identification obtained a macro‐ F1 score of 0.895. We conduct a subjective evaluation by human annotators, showing that humans attain much lower accuracy rates compared with ML models. Hence, it remains unclear why the methods proposed by participants attain such high accuracy rates. Our goal is to understand (i) why the proposed methods work so well (by visualizing the discriminative features) and (ii) to what extent these methods can keep their high accuracy levels, for example, when we shorten the text samples to single sentences or when we use tweets at inference time. A secondary goal of our work is to propose an improved ML model using ensemble learning. Our experiments show that ML models can accurately identify the dialects, even at the sentence level and across different domains (news articles vs. tweets). We also analyze the most discriminative features of the best performing models, providing some explanations behind the decisions taken by these models. Interestingly, we learn new dialectal patterns previously unknown to us or to our human annotators. Furthermore, we conduct experiments showing that the ML performance on the MRC shared task can be improved through an ensemble based on stacking.

Abstract

References203 items

BERT Goes Brrr: A Venture Towards the Lesser Error in Classifying Medical Self-Reporters on Twitter

Language-agnostic Topic Classification for Wikipedia

Clustering Word Embeddings with Self-Organizing Maps. Application on LaRoSeDa - A Large Romanian Sentiment Data Set

Toxic Comment Classification For French Online Comments

A Report on the VarDial Evaluation Campaign 2020

SentiX: A Sentiment-Aware Pre-Trained Model for Cross-Domain Sentiment Analysis

Text Classification of Manifestos and COVID-19 Press Briefings using BERT and Convolutional Neural Networks.

The birth of Romanian BERT

A Survey on Text Classification: From Traditional to Deep Learning

Tuning the Turkish Text Classification Process Using Supervised Machine Learning-based Algorithms

GAN-BERT: Generative Adversarial Learning for Robust Text Classification with a Bunch of Labeled Examples

FlauBERT : des modèles de langue contextualisés pré-entraînés pour le français (FlauBERT : Unsupervised Language Model Pre-training for French)

QADI: Arabic Dialect Identification in the Wild

Spoken Arabic dialect recognition using X-vectors

Transfer learning applied to text classification in Spanish radiological reports

ADI17: A Fine-Grained Arabic Dialect Identification Dataset

VICTOR: a Dataset for Brazilian Legal Documents Classification

Arabic Dialect Identification in Social Media

Big Data Classification Efficiency Based on Linear Discriminant Analysis

Rao-SVM Machine Learning Algorithm for Intrusion Detection System

Arabic text classification using deep learning models

BERTje: A Dutch BERT Model

CamemBERT: a Tasty French Language Model

Toward any-language zero-shot topic classification of textual documents

The Evaluation of Word Embedding Models and Deep Learning Algorithms for Turkish Text Classification

ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation

Arabic Dialect Identification for Travel and Twitter Text

The MADAR Shared Task on Arabic Fine-Grained Dialect Identification

Pre-trained Contextualized Representation for Chinese Conversation Topic Classification

SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification

A Report on the Third VarDial Evaluation Campaign

The R2I_LIS Team Proposes Majority Vote for VarDial’s MRC Task

Topic Classification from Text Using Decision Tree, K-NN and Multinomial Naïve Bayes

Deep Learning versus Conventional Machine Learning for Detection of Healthcare-Associated Infections in French Clinical Narratives

Vector of Locally-Aggregated Word Embeddings (VLAWE): A Novel Document-level Representation

MOROCO: The Moldavian and Romanian Dialectal Corpus

Multi-Label Topic Classification of Hadith of Bukhari (Indonesian Language Translation)Using Information Gain and Backpropagation Neural Network

CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

Improving the results of string kernels in sentiment analysis and Arabic dialect identification by adapting them to your test set

Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign

Character-Level Language Modeling with Deeper Self-Attention

Fine-Grained Arabic Dialect Identification

HeLI-based Experiments in Swiss German Dialect Identification

Sentiment Analysis for Software Engineering: How Far Can We Go?

On the Practical Computational Power of Finite Precision RNNs for Language Recognition

UnibucKernel Reloaded: First Place in Arabic Dialect Identification for the Second Year in a Row

The MADAR Arabic Dialect Corpus and Lexicon

News Topic Classification Using Mutual Information and Bayesian Network

DART: A Large Dataset of Dialectal Arabic Tweets

The Reference Corpus of the Contemporary Romanian Language (CoRoLa)

Automated essay scoring with string kernels and word embeddings

Automatic Identification of Closely-related Indian Languages: Resources and Experiments

Light Gated Recurrent Units for Speech Recognition

Verbal aggression detection on Twitter comments: convolutional neural network for short-text sentiment analysis

Application of text classification and clustering of Twitter data for business analytics

Learning Word Vectors for 157 Languages

A Comparison of Oversampling Methods on Imbalanced Topic Classification of Korean News Articles

Sentiment Analysis of Movie Reviews Using Machine Learning Techniques

Word embeddings quantify 100 years of gender and ethnic stereotypes

Hierarchical Hybrid Attention Networks for Chinese Conversation Topic Classification

Speech recognition challenge in the wild: Arabic MGB-3

Cross-Lingual Classification of Topics in Political Texts

Can string kernels pass the test of time in Native Language Identification?

A hybrid Latent Dirichlet Allocation approach for topic classification

Attention is All you Need

GaKCo: a Fast Gapped k-mer string Kernel using Counting

Single and Cross-domain Polarity Classification using String Kernels

Findings of the VarDial Evaluation Campaign 2017

Learning to Identify Arabic and German Dialects using Multiple Kernels

Massive Exploration of Neural Machine Translation Architectures

Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task

UnibucKernel: An Approach for Arabic Dialect Identification Based on Multiple String Kernels

Improving Multi-Document Summarization via Text Classification

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Acoustic Modeling Using Bidirectional Gated Recurrent Convolutional Units

String Kernels for Native Language Identification: Insights from Behind the Curtains

Arabic Dialect Identification with an Unsupervised Learning (Based on a Lexicon). Application Case: ALGERIAN Dialect

Enriching Word Vectors with Subword Information

Bag of Tricks for Efficient Text Classification

Using word embeddings in Twitter election classification

Hierarchical Attention Networks for Document Classification

ArchiMob - A Corpus of Spoken Swiss German

A Computational Perspective on the Romanian Dialects

Universal Dependencies v1: A Multilingual Treebank Collection

Learning Word Embeddings from Wikipedia for Content-Based Recommender Systems

Automatic Text Classification in Information retrieval: A Survey

A Survey of Opinion Mining and Sentiment Analysis

Character-level Convolutional Networks for Text Classification

Overview of the DSL Shared Task 2015

Character-Aware Neural Language Models

Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation

Improved Transition-based Parsing by Modeling Characters instead of Words with LSTMs

LSTM: A Search Space Odyssey

A Framework for Space-Efficient String Kernels

A linguistic approach for determining the topics of Spanish Twitter messages

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Topic classification in Romanian blogosphere

Can characters reveal your native language? A language-independent approach to native language identification

GloVe: Global Vectors for Word Representation

100

Convolutional Neural Networks for Sentence Classification

101

A Report on the DSL Shared Task 2014

102

Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts

103

A Probabilistic Model for Learning Multi-Prototype Word Embeddings

104

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

105

On the syllabic structures of Aromanian

106

Arabic Dialect Identification

107

Dynamic Scene Understanding for Behavior Analysis Based on String Kernels

108

Indian Language Text Representation and Categorization Using Supervised Learning Algorithm

109

Distributed Representations of Words and Phrases and their Compositionality

110

Automatic Crowdsourcing-Based Classification of Marketing Messaging on Twitter

111

Versatile string kernels

112

The Story of the Characters, the DNA and the Native Language

113

Efficient Estimation of Word Representations in Vector Space

114

ImageNet classification with deep convolutional neural networks

115

Improving Word Representations via Global Context and Multiple Word Prototypes

116

Text classification of web based news articles by using Turkish grammatical features

117

TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY

118

Chinese dialect identification based on gender classification

119

Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach

120

Generating Text with Recurrent Neural Networks

121

The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content

122

Who is tweeting on Twitter: human, bot, or cyborg?

123

Multi-Prototype Vector-Space Models of Word Meaning

124

Lossless Compression Based on the Sequence Memoizer

125

Ensemble-based classifiers

126

Text classification in the Turkish marketing domain for context sensitive ad distribution

127

A stochastic memoizer for sequence data

128

Spoken Arabic Dialect Identification Using Phonotactic Modeling

129

Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous

130

Semi-supervised learning based Chinese dialect identification

131

A unified architecture for natural language processing: deep neural networks with multitask learning

132

Using string kernels to identify famous performers from their playing style

133

Gaussian mixture selection and data selection for unsupervised Spanish dialect classification

134

Some Effective Techniques for Naive Bayes Text Classification

135

Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation

136

Using String-Kernels for Learning Semantic Parsers

137

Chinese Dialect Identification Using Tone Features Based on Pitch Flux

138

Applied Logistic Regression: Hosmer/Applied Logistic Regression

139

Multi-document Summarization Based on Link Analysis and Text Classification

140

Learning methods for generic object recognition with invariance to pose and lighting

141

Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy

142

A Neural Probabilistic Language Model

143

Thumbs up? Sentiment Classification using Machine Learning Techniques

144

Discriminative training of Gaussian mixture bigram models with application to Chinese dialect identification

145

Machine learning in automated text categorization

146

Learning to Forget: Continual Prediction with LSTM

147

Ridge Regression: Biased Estimation for Nonorthogonal Problems

148

Popular Ensemble Methods: An Empirical Study

149

Ridge Regression Learning Algorithm in Dual Variables

150

Long Short-Term Memory

151

Automatic dialect identification of extemporaneous conversational, Latin American Spanish speech

152

Support-Vector Networks

153

Word Space

154

Stacked generalization

155

Backpropagation Applied to Handwritten Zip Code Recognition

156

Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position

157

Comparing the Performance of CNNs and Shallow Models for Language Identification

158

Naive Bayes-based Experiments in Romanian Dialect Identification

159

Findings of the VarDial Evaluation Campaign 2021

160

Dialect Identification through Adversarial Learning and Knowledge Distillation on Romanian BERT

161

Optimizing Semantic Deep Forest for tweet topic classification

162

Experiments in Language Variety Geolocation and Dialect Identification

163

Discriminating between standard Romanian and Moldavian tweets using filtered character ngrams

164

A dual-encoding system for dialect classification

165

Applying Multilingual and Monolingual Transformer-Based Models for Dialect Identification

166

Exploring the Power of Romanian BERT for Dialect Identification

167

Dialect Identification under Domain Shift: Experiments with Discriminating Romanian and Moldavian

168

DTeam @ VarDial 2019: Ensemble based on skip-gram and triplet loss neural networks for Moldavian vs. Romanian cross-dialect topic identification

169

Language Discrimination and Transfer Learning for Similar Languages: Experiments with Feature Combinations and Adaptation

170

AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets

171

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

172

COMPUTING DISTRIBUTED REPRESENTATIONS OF WORDS USING THE COROLA CORPUS

173

Sentence selection with neural networks using string kernels

174

Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter

175

HASKER: An efficient algorithm for string kernels. Application to polarity classification in various languages

176

Ensembles of Methods for Tweet Topic Classification

177

Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

178

ACTIV-ES: a comparable, cross-dialect corpus of ‘everyday’ Spanish from Argentina, Mexico, and Spain

179

Romanian-Speaking Communities Outside Romania: Linguistic Identities

180

Multi-label Classification for Recommender Systems

181

Trends in Practical Applications of Agents and Multiagent Systems - 11th International Conference on Practical Applications of Agents and Multi-Agent Systems (PAAMS 2013) Special Sessions, Salamanca, Spain, May 22-24, 2013

182

A Survey of Opinion Mining and Sentiment Analysis

183

Chinese Dialect Identification Based on DBN

184

Limba română—unitate în diversitate (Romanian language—there is unity in diversity)

185

Learning Deep Architectures for AI

186

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study

187

Application of String Kernels in Protein Sequence Classification

188

Kernel Methods for Pattern Analysis

189

Dialect identification using Gaussian mixture models

190

Text Classification using String Kernels

191

Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies

192

Miniature Empires: A Historical Dictionary of the Newly Independent States

193

Face recognition: a convolutional neural-network approach

194

Learning with ensembles: How over-(cid:12)tting can be useful

195

Learning with Ensembles: How Over--tting Can Be Useful

196

How to cite this article: Găman M, Ionescu RT. The unreasonable effectiveness of machine learning in Moldavian versus Romanian dialect identification

197

Generalization of backpropagation with application to a recurrent gas market model

198

Limba română. Privire generală. I. Minerva

199

Compendiu de dialectologie română : (nord= și sud=dunăreană)

200

Studii de dialectologie și toponimie

201

Istoria Limbii Române (History of the Romanian Language), vol. II . Bucharest, Romania

202

Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence WSABIE: Scaling Up To Large Vocabulary Image Annotation

203

TL;DR

Abstract

TL;DR

Abstract

Authors

References203 items

BERT Goes Brrr: A Venture Towards the Lesser Error in Classifying Medical Self-Reporters on Twitter

Language-agnostic Topic Classification for Wikipedia

Clustering Word Embeddings with Self-Organizing Maps. Application on LaRoSeDa - A Large Romanian Sentiment Data Set

Toxic Comment Classification For French Online Comments

A Report on the VarDial Evaluation Campaign 2020

SentiX: A Sentiment-Aware Pre-Trained Model for Cross-Domain Sentiment Analysis

Text Classification of Manifestos and COVID-19 Press Briefings using BERT and Convolutional Neural Networks.

The birth of Romanian BERT

A Survey on Text Classification: From Traditional to Deep Learning

Tuning the Turkish Text Classification Process Using Supervised Machine Learning-based Algorithms

GAN-BERT: Generative Adversarial Learning for Robust Text Classification with a Bunch of Labeled Examples

FlauBERT : des modèles de langue contextualisés pré-entraînés pour le français (FlauBERT : Unsupervised Language Model Pre-training for French)

QADI: Arabic Dialect Identification in the Wild

Spoken Arabic dialect recognition using X-vectors

Transfer learning applied to text classification in Spanish radiological reports

ADI17: A Fine-Grained Arabic Dialect Identification Dataset

VICTOR: a Dataset for Brazilian Legal Documents Classification

Arabic Dialect Identification in Social Media

Big Data Classification Efficiency Based on Linear Discriminant Analysis

Rao-SVM Machine Learning Algorithm for Intrusion Detection System

Arabic text classification using deep learning models

BERTje: A Dutch BERT Model

CamemBERT: a Tasty French Language Model

Toward any-language zero-shot topic classification of textual documents

The Evaluation of Word Embedding Models and Deep Learning Algorithms for Turkish Text Classification

ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation

Arabic Dialect Identification for Travel and Twitter Text

The MADAR Shared Task on Arabic Fine-Grained Dialect Identification

Pre-trained Contextualized Representation for Chinese Conversation Topic Classification

SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification

A Report on the Third VarDial Evaluation Campaign

The R2I_LIS Team Proposes Majority Vote for VarDial’s MRC Task

Topic Classification from Text Using Decision Tree, K-NN and Multinomial Naïve Bayes

Deep Learning versus Conventional Machine Learning for Detection of Healthcare-Associated Infections in French Clinical Narratives

Vector of Locally-Aggregated Word Embeddings (VLAWE): A Novel Document-level Representation

MOROCO: The Moldavian and Romanian Dialectal Corpus

Multi-Label Topic Classification of Hadith of Bukhari (Indonesian Language Translation)Using Information Gain and Backpropagation Neural Network

CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

Improving the results of string kernels in sentiment analysis and Arabic dialect identification by adapting them to your test set

Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign

Character-Level Language Modeling with Deeper Self-Attention

Fine-Grained Arabic Dialect Identification

HeLI-based Experiments in Swiss German Dialect Identification

Sentiment Analysis for Software Engineering: How Far Can We Go?

On the Practical Computational Power of Finite Precision RNNs for Language Recognition

UnibucKernel Reloaded: First Place in Arabic Dialect Identification for the Second Year in a Row

The MADAR Arabic Dialect Corpus and Lexicon

News Topic Classification Using Mutual Information and Bayesian Network

DART: A Large Dataset of Dialectal Arabic Tweets

The Reference Corpus of the Contemporary Romanian Language (CoRoLa)

Automated essay scoring with string kernels and word embeddings

Automatic Identification of Closely-related Indian Languages: Resources and Experiments

Light Gated Recurrent Units for Speech Recognition

Verbal aggression detection on Twitter comments: convolutional neural network for short-text sentiment analysis

Application of text classification and clustering of Twitter data for business analytics

Learning Word Vectors for 157 Languages

A Comparison of Oversampling Methods on Imbalanced Topic Classification of Korean News Articles

Sentiment Analysis of Movie Reviews Using Machine Learning Techniques

Word embeddings quantify 100 years of gender and ethnic stereotypes

Hierarchical Hybrid Attention Networks for Chinese Conversation Topic Classification

Speech recognition challenge in the wild: Arabic MGB-3

Cross-Lingual Classification of Topics in Political Texts

Can string kernels pass the test of time in Native Language Identification?

A hybrid Latent Dirichlet Allocation approach for topic classification

Attention is All you Need

GaKCo: a Fast Gapped k-mer string Kernel using Counting

Single and Cross-domain Polarity Classification using String Kernels

Findings of the VarDial Evaluation Campaign 2017

Learning to Identify Arabic and German Dialects using Multiple Kernels

Massive Exploration of Neural Machine Translation Architectures

Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task

UnibucKernel: An Approach for Arabic Dialect Identification Based on Multiple String Kernels

Improving Multi-Document Summarization via Text Classification

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization