Sparsifying Transformer Models with Trainable Representation Pooling (2020-09-10T00:00:00.000000Z)

TL;DR

A novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process, thus focusing on the task-specific parts of an input.

Abstract

We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process, thus focusing on the task-specific parts of an input. A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top-k operator.Our experiments on a challenging long document summarization task show that even our simple baseline performs comparably to the current SOTA, and with trainable pooling we can retain its top quality, while being 1.8\times faster during training, 4.5\times faster during inference, and up to 13\times more computationally efficient in the decoder.

Authors

Łukasz Borchmann

4 papers

Michal Pietruszka

3 papers

Lukasz Garncarek

2 papers

TL;DR

Abstract

Authors

References45 items

Doc2Dict: Information Extraction as Text Generation

Poolingformer: Long Document Modeling with Pooling Attention

Multiscale Vision Transformers

Hierarchical Learning for Generation with Long Source Sequences

Successive Halving Top-k Operator

Big Bird: Transformers for Longer Sequences

Do Transformers Need Deep Long-Range Memory?

Linformer: Self-Attention with Linear Complexity

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

Longformer: The Long-Document Transformer

A Divide-and-Conquer Approach to the Summarization of Long Documents

Efficient Content-Based Sparse Attention with Routing Transformers

Sparse Sinkhorn Attention

Differentiable Top-k Operator with Optimal Transport

PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination

Reformer: The Efficient Transformer

PyTorch: An Imperative Style, High-Performance Deep Learning Library

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Generating Long Sequences with Sparse Transformers

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

Reparameterizable Subset Sampling via Continuous Relaxations

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

Efficient Attention: Attention with Linear Complexities

Neural Nearest Neighbors Networks

Bottom-Up Abstractive Summarization

Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting

A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

The NarrativeQA Reading Comprehension Challenge

Temporal dynamics of eye‐tracking and EEG during reading and relevance decisions

A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models

Attention is All you Need

Differentiable Scheduled Sampling for Credit Assignment

Coarse-to-Fine Question Answering for Long Documents

Making do with what we have: use your bootstraps.

Understanding the strategies of document literacy and their conditions of use.

On Extractive and Abstractive Neural Document Summarization with Transformer Language Models

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Improving Language Understanding by Generative Pre-Training

Types of Document Knowledge: From Structures to Strategies (Document Strategies).

# ROUGE-1 (CI) ROUGE-2 (CI)

2020) parametrized the top-k operator in terms of an optimal transport problem. Employing such an algorithm instead of softmax may induce numerous zero weights in the attention matrix

We assumed a fixed length of 256 or 512 tokens to decode to discount for lower processing time of models predicting the end of sequence token earlier

pooling). PoWER-BERT. As it comes to the PoWER-based models, we finetune Vanilla transformers with a progressive elimination of word vectors on the encoder side, following the approach

Without local attention, their results were several points lower. We assumed an LSH bucket size of 64 and four parallel hashes. Bucket size follows the authors

Field of Study

Venue Information

Name

Type

URL

Alternate Names