AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting (2020-08-03T00:00:00.000000Z)

TL;DR

This work proposes a novel text spotter, named Ambiguity Eliminating Text Spotter (AE TextSpotter), which learns both visual and linguistic features to significantly reduce ambiguity in text detection, and is the first time to improve text detection by using a language model.

Abstract

Scene text spotting aims to detect and recognize the entire word or sentence with multiple characters in natural images. It is still challenging because ambiguity often occurs when the spacing between characters is large or the characters are evenly spread in multiple rows and columns, making many visually plausible groupings of the characters (e.g. "BERLIN" is incorrectly detected as "BERL" and "IN" in Fig. 1(c)). Unlike previous works that merely employed visual features for text detection, this work proposes a novel text spotter, named Ambiguity Eliminating Text Spotter (AE TextSpotter), which learns both visual and linguistic features to significantly reduce ambiguity in text detection. The proposed AE TextSpotter has three important benefits. 1) The linguistic representation is learned together with the visual representation in a framework. To our knowledge, it is the first time to improve text detection by using a language model. 2) A carefully designed language module is utilized to reduce the detection confidence of incorrect text lines, making them easily pruned in the detection stage. 3) Extensive experiments show that AE TextSpotter outperforms other state-of-the-art methods by a large margin. For example, we carefully select a validation set of extremely ambiguous samples from the IC19-ReCTS dataset, where our approach surpasses other methods by more than 4%. The code has been released at this https URL. The image list and evaluation scripts of the validation set have been released at this https URL.

Authors

References33 items

PyTorch: An Imperative Style, High-Performance Deep Learning Library

TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting

ICDAR 2019 Robust Reading Challenge on Reading Chinese Text on Signboard

ASTER: An Attentional Scene Text Recognizer with Flexible Rectification

Towards Unconstrained End-to-End Text Spotting

Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network

Pyramid Mask Text Detector

Scene Text Detection with Supervised Pyramid Context Network

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

Shape Robust Text Detection With Progressive Scale Expansion Network

TextBoxes++: A Single-Shot Oriented Scene Text Detector

FOTS: Fast Oriented Text Spotting with a Unified Network

PixelLink: Detecting Scene Text via Instance Segmentation

Gated Recurrent Convolution Neural Network for OCR

AON: Towards Arbitrarily-Oriented Text Recognition

Focusing Attention: Towards Accurate Text Recognition in Natural Images

Towards End-to-End Text Spotting with Convolutional Recurrent Neural Networks

EAST: An Efficient and Accurate Scene Text Detector

Mask R-CNN

Detecting Oriented Text in Natural Images by Linking Segments

Feature Pyramid Networks for Object Detection

TextBoxes: A Fast Text Detector with a Single Deep Neural Network

Detecting Text in Natural Image with Connectionist Text Proposal Network

Deep Residual Learning for Image Recognition

An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

ImageNet: A large-scale hierarchical image database

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

STAR-Net: A SpaTial Attention Residue Network for Scene Text Recognition

Connectionist Temporal Classification

Connectionist Temporal Classiﬁcation: Labelling Unsegmented Sequence Data with Recurrent Neural Networks