3260 papers • 126 benchmarks • 313 datasets
Language identification is the task of determining the language of a text.
(Image credit: Papersgraph)
These leaderboards are used to track progress in language-identification-4
Use these libraries to find language-identification-4 models and implementations
This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification, a publicly available, free of charge dataset of short text extracts from Wikipedia that contains 1000 paragraphs of 235 languages.
The core architecture of SpeechBrain is described, designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines.
GlotLID-M is published, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency, and it identifies 1665 languages, a large increase in coverage compared to prior work.
This paper presents a treebank of Hindi-English code-switching tweets under Universal Dependencies scheme and proposes a neural stacking model for parsing that efficiently leverages the part-of-speech tag and syntactic tree annotations in the code- Switching treebank and the preexisting Hindi and English treebanks.
The Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, is complied and made publicly available.
The results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval), based on a new dataset, contain over 14,000 English tweets, are presented.
This work introduces an encoder capturing word-level representations of speech for cross-task transfer learning and shows that the speech representation captured by the encoder through the pre-training is transferable across distinct speech processing tasks and datasets.
This work presents speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit, and finds an average Character Error Rate improvement for twelve target languages, for most of these languages, these are the first ever published results on end- to-end Automatic Speech Recognition.
This paper generates semi-random search phrases from language-specific Wikipedia data that are then used to retrieve videos from YouTube for 107 languages and uses the data to build language recognition models for several spoken language identification tasks.
Adding a benchmark result helps the community track progress.