Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning (2015-08-31T00:00:00.000000Z)

TL;DR

A new automated disambiguation solution exploiting more than one million crowdsourced annotations to learn an accurate classifier for identifying coreferring authors and to guide the clustering of scientific publications by distinct authors in a semi-supervised way is proposed.

Abstract

Building on more than one million crowdsourced annotations that we publicly release, we propose a new automated disambiguation solution exploiting this data (i) to learn an accurate classifier for identifying coreferring authors and (ii) to guide the clustering of scientific publications by distinct authors in a semi-supervised way. To the best of our knowledge, our analysis is the first to be carried out on data of this size and coverage. With respect to the state of the art, we validate the general pipeline used in most existing solutions, and improve by: (i) proposing new phonetic-based blocking strategies, thereby increasing recall; (ii) adding strong ethnicity-sensitive features for learning a linkage function, thereby tailoring disambiguation to non-Western author names whenever necessary; and (iii) showing the importance of balancing negative and positive examples when learning the linkage function.

Authors

Gilles Louppe

5 papers

Hussein T. Al-Natsheh

3 papers

Mateusz Susik

1 papers

TL;DR

Abstract

Authors

References40 items

Large scale author name disambiguation in digital libraries

Author Name Disambiguation by Using Deep Neural Network

Exploiting citation networks for large-scale author name disambiguation

Understanding variable importances in forests of randomized trees

Author Name Disambiguation for PubMed

The distorted mirror of Wikipedia: a quantitative analysis of Wikipedia coverage of academics

Effective string processing and matching for author disambiguation

Author name disambiguation: What difference does it make in author-based citation analysis?

A brief survey of automatic methods for author name disambiguation

Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching

Citation-based bootstrapping for large-scale author disambiguation

Frequency-aware similarity measures: why Arnold Schwarzenegger is always a duplicate

Scikit-learn: Machine Learning in Python

On Graph-Based Name Disambiguation

Effective self-training author name disambiguation in scholarly digital libraries

Author name disambiguation in MEDLINE

Disambiguating authors in academic publications using random forests

LIBLINEAR: A Library for Large Linear Classification

Information Resources in High-Energy Physics: Surveying the Present Landscape and Charting the Future Course

Efficient topic-based unsupervised name disambiguation

Efficient Name Disambiguation for Large-Scale Databases

Separating the articles of authors with the same name

Also by the same author: AKTiveAuthor, a citation graph approach to name disambiguation

Two supervised learning approaches for name disambiguation in author citations

Gene Selection for Cancer Classification using Support Vector Machines

Greedy function approximation: A gradient boosting machine.

The structure of scientific collaboration networks.

The double metaphone search algorithm

A Theory for Record Linkage

Hierarchical Grouping to Optimize an Objective Function

Independent consultant

On co-authorship for author disambiguation

Author name disambiguation

Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function

The National Archives The soundex indexing system

Unsupervised Name Disambiguation via Social Network Similarity

Random Forests

SciPy: Open Source Scientific Tools for Python

releasing the data publicly, we hope to provide the basis for further research on author disambiguation and related topics

Name search techniques. Technical Report Special Report No. 1, New York State Identiﬁcation and Intelligence System, Albany, NY

Field of Study

Venue Information

Name

Type

URL

Alternate Names