BigNLI: Native Language Identification with Big Bird Embeddings

Published in

International Conference on Language Resources ...(2023)

External Links:

Generate Graph DownloadPDF

TL;DR

This work shows input size is a limiting factor, and that classifiers trained using Big Bird embeddings outperform linguistic feature engineering models by a large margin on the Reddit-L2 dataset, and offers further insight into input length dependencies.

Abstract

Native Language Identification (NLI) intends to classify an author’s native language based on their writing in another language. Historically, the task has heavily relied on time-consuming linguistic feature engineering, and NLI transformer models have thus far failed to offer effective, practical alternatives. The current work shows input size is a limiting factor, and that classifiers trained using Big Bird embeddings outperform linguistic feature engineering models (for which we reproduce previous work) by a large margin on the Reddit-L2 dataset. Additionally, we provide further insight into input length dependencies, show consistent out-of-sample (Europe subreddit) and out-of-domain (TOEFL-11) performance, and qualitatively analyze the embedding space. Given the effectiveness and computational efficiency of this method, we believe it offers a promising avenue for future NLI work.

Authors

Sergey Kramp

1 papers

Giovanni Cassani

1 papers

Chris Emmery

2 papers

References39 items

Scaling Transformer to 1M tokens and beyond with RMT

A Deep Generative Approach to Native Language Identification

Big Bird: Transformers for Longer Sequences

Language Models are Few-Shot Learners

Longformer: The Long-Document Transformer

BigNLI: Native Language Identification with Big Bird Embeddings

Published in

International Conference on Language Resources ...(2023)

External Links:

Generate Graph DownloadPDF

TL;DR

Abstract

Authors

Sergey Kramp

1 papers

Giovanni Cassani

1 papers

Chris Emmery

2 papers

References39 items

Scaling Transformer to 1M tokens and beyond with RMT

A Deep Generative Approach to Native Language Identification

Big Bird: Transformers for Longer Sequences

Language Models are Few-Shot Learners

Longformer: The Long-Document Transformer

PyTorch: An Imperative Style, High-Performance Deep Learning Library

How to Fine-Tune BERT for Text Classification?

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

Native Language Cognate Effects on Second Language Lexical Choice

Universal Language Model Fine-tuning for Text Classification

Neural Networks and Spelling Features for Native Language Identification

Combining Textual and Speech Features in the NLI Task Using State-of-the-Art Machine Learning Techniques

Improving Native Language Identification by Using Spelling Errors

Attention is All you Need

On the features of translationese

Adam: A Method for Stochastic Optimization

TOEFL11: A CORPUS OF NON‐NATIVE ENGLISH

How Noisy Social Media Text, How Diffrnt Social Media Sources?

A Report on the 2017 Native Language Identification Shared Task

Linguistic Profiling based on General–purpose Features and Native Language Identification

How language production shapes language form and comprehension

Scikit-learn: Machine Learning in Python

Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit

Automatically Determining an Anonymous Author's Native Language

Complex Linguistic Features for Text Classification: A Comprehensive Study

On the limited memory BFGS method for large scale optimization

Hierarchical Grouping to Optimize an Objective Function

Principal Components

Native-Language Identification with Attention

Mulbregt, and SciPy 1.0 Contributors

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Language Models are Unsupervised Multitask Learners

Native Language Identification with User Generated Content

A Report on the First Native Language Identification Shared Task

International Conference on Knowledge Based and Intelligent Information and Engineering Systems , KES 2017 , 6-8 September 2017 , Marseille , France Bridging the Native Language and Language Variety Identification Tasks

Probabilistic Principal Component Analysis

Language Transfer

Analysis of a complex of statistical variables into principal components.

, Pier - ric Cistac , Tim Rault , Rémi Louf , Morgan Funtow - icz , and Jamie Brew . 2019 . Huggingface ’ s transformers : State - ofthe - art natural language process

Field of Study

Computer Science

Journal Information

Name

ArXiv

Volume

abs/2005.00687

Venue Information

Name

International Conference on Language Resources and Evaluation

Type

conference

URL

http://www.lrec-conf.org/

Alternate Names

LREC
Int Conf Lang Resour Evaluation