MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval (2023-07-02T00:00:00.000000Z)

TL;DR

Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models including much larger models such as GPT-3-sized cpt-text-XL.

Abstract

Motivation Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. Results To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely-integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks. Availability The MedCPT code and API are available at https://github.com/ncbi/MedCPT.

Authors

Zhiyong Lu

4 papers

Qiao Jin

2 papers

Qingyu Chen

3 papers

TL;DR

Abstract

Authors

References84 items

Opportunities and Challenges for ChatGPT and Large Language Models in Biomedicine and Health

Retrieve, Summarize, and Verify: How will ChatGPT impact information seeking from the medical literature?

Augmented Language Models: a Survey

State-of-the-Art Evidence Retriever for Precision Medicine: Algorithm Development and Validation

A comparative evaluation of biomedical similar article recommendation

PaLM: Scaling Language Modeling with Pathways

Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings

Survey of Hallucination in Natural Language Generation

Text and Code Embeddings by Contrastive Pre-Training

Large Dual Encoders Are Generalizable Retrievers

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling

Biomedical Question Answering: A Survey of Approaches and Challenges

Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline

Pretrained Transformers for Text Ranking: BERT and Beyond

Transformers: State-of-the-Art Natural Language Processing

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Context-Aware Term Weighting For First Stage Passage Retrieval

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

Language Models are Few-Shot Learners

Fact or Fiction: Verifying Scientific Claims

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

CORD-19: The Covid-19 Open Research Dataset

SPECTER: Document-level Representation Learning using Citation-informed Transformers

Dense Passage Retrieval for Open-Domain Question Answering

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

BioWordVec, improving biomedical word embeddings with subword information and MeSH

A survey on literature based discovery approaches in biomedical domain

LitSense: making sense of biomedical literature at sentence level

Publicly Available Clinical BERT Embeddings

SciBERT: A Pretrained Language Model for Scientific Text

Simplifying Graph Convolutional Networks

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

A question-entailment approach to question answering

Passage Re-ranking with BERT

Large expert-curated database for benchmarking document similarity detection in biomedical literature search

BioSentVec: creating sentence embeddings for biomedical texts

Transformers

How user intelligence is improving PubMed

MedSTS: a resource for clinical semantic textual similarity

Best Match: New relevance search for PubMed

Universal Sentence Encoder

Content-Based Citation Recommendation

Deep Contextualized Word Representations

A Field Sensor: computing the composition and intent of PubMed queries

Database resources of the National Center for Biotechnology Information

Overview of the TREC 2020 Precision Medicine Track

BIOSSES: a semantic sentence similarity estimation system for the biomedical domain

Attention is All you Need

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

A Simple but Tough-to-Beat Baseline for Sentence Embeddings

Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features

Billion-Scale Similarity Search with GPUs

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

How to Train good Word Embeddings for Biomedical NLP

Enriching Word Vectors with Subword Information

MIMIC-III, a freely accessible critical care database

A Full-Text Learning to Rank Dataset for Medical Information Retrieval

Multi-Factor Duplicate Question Detection in Stack Overflow

Click Models for Web Search

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition

Overview of the TREC 2014 Clinical Decision Support Track

Distributed Representations of Sentences and Documents

Distributed Representations of Words and Phrases and their Compositionality

Text mining for the biocuration workflow

PubMed and beyond: a survey of web tools for searching biomedical literature

Understanding PubMed® user search behavior through log analysis

The Probabilistic Relevance Framework: BM25 and Beyond

PubMed related articles: a probabilistic topic-based model for content similarity

Research Paper: Answering Physicians' Clinical Questions: Obstacles and Potential Solutions

Multi-stage Literature Retrieval System Trained by PubMed Search Logs for Biomedical Question Answering

Towards Unsupervised Dense Information Retrieval with Contrastive Learning