How to Train good Word Embeddings for Biomedical NLP

Published in

(2016)

External Links:

Generate Graph DownloadPDF

TL;DR

A comprehensive study of how the quality of embeddings changes according to hyper-parameter settings is presented, and it is observed that bigger corpora do not necessarily produce better biomedical domain word embedDings.

Abstract

The quality of word embeddings depends on the input corpora, model architectures, and hyper-parameter settings. Using the state-of-the-art neural embedding tool word2vec and both intrinsic and extrinsic evaluations, we present a comprehensive study of how the quality of embeddings changes according to these features. Apart from identifying the most inﬂuential hyper-parameters, we also observe one that creates contradictory re-sults between intrinsic and extrinsic evaluations. Furthermore, we ﬁnd that bigger corpora do not necessarily produce better biomedical domain word embeddings. We make our evaluation tools and resources as well as the created state-of-the-art word embeddings available under open licenses from https://github.com/ cambridgeltl/BioNLP-2016 .

Authors

A. Korhonen

17 papers

Sampo Pyysalo

4 papers

Billy Chiu

1 papers

References21 items

Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance

Evaluating distributed word representations for capturing semantics of biomedical concepts

Improving Distributional Similarity with Lessons Learned from Word Embeddings

A Large Scale Evaluation of Distributional Semantic Models: Parameters, Interactions and Model Selection

GloVe: Global Vectors for Word Representation

How to Train good Word Embeddings for Biomedical NLP

Published in

(2016)

External Links:

Generate Graph DownloadPDF

TL;DR

Abstract

Authors

A. Korhonen

17 papers

Sampo Pyysalo

4 papers

Billy Chiu

1 papers

References21 items

Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance

Evaluating distributed word representations for capturing semantics of biomedical concepts

Improving Distributional Similarity with Lessons Learned from Word Embeddings

A Large Scale Evaluation of Distributional Semantic Models: Parameters, Interactions and Model Selection

GloVe: Global Vectors for Word Representation

Gamal K. O. Crichton

1 papers

SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation

GPLSI: Supervised Sentiment Analysis in Twitter using Skipgrams

Distributed Representations of Words and Phrases and their Compositionality

Efficient Estimation of Word Representations in Vector Space

Domain and Function: A Dual-Space Model of Semantic Relations and Compositions

Semantic Similarity and Relatedness between Clinical Terms: An Experimental Study.

Word Representations: A Simple and General Method for Semi-Supervised Learning

Overview of BioCreative II gene mention recognition

A unified architecture for natural language processing: deep neural networks with multitask learning

Introduction to the Bio-entity Recognition Task at JNLPBA

A Neural Probabilistic Language Model

Biomedical Semantic Indexing using Dense Word Vectors in BioASQ

Distributional Semantics Resources for Biomedical Text Processing

Size (and Domain) Matters: Evaluating Semantic Word Space Representations for Biomedical Text

Akane system: protein-protein interaction pairs in biocreative2 challenge, ppi-ips subtask

NLTK: The Natural Language Toolkit

Field of Study

Computer Science