Scene Text Recognition with Permuted Autoregressive Sequence Models

被引:127
作者
Bautista, Darwin [1 ]
Atienza, Rowel [1 ]
机构
[1] Univ Philippines, Elect & Elect Engn Inst, Quezon City, Philippines
来源
COMPUTER VISION - ECCV 2022, PT XXVIII | 2022年 / 13688卷
关键词
Scene text recognition; Permutation language modeling; Autoregressive modeling; Cross-modal attention; Transformer;
D O I
10.1007/978-3-031-19815-1_11
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Context-aware STR methods typically use internal autoregressive (AR) language models (LM). Inherent limitations of AR models motivated two-stage methods which employ an external LM. The conditional independence of the external LM on the input image may cause it to erroneously rectify correct predictions, leading to significant inefficiencies. Our method, PARSeq, learns an ensemble of internal AR LMs with shared weights using Permutation Language Modeling. It unifies context-free non-AR and context-aware AR inference, and iterative refinement using bidirectional context. Using synthetic training data, PARSeq achieves state-of-the-art (SOTA) results in STR benchmarks (91.9% accuracy) and more challenging datasets. It establishes new SOTA results (96.0% accuracy) when trained on real data. PARSeq is optimal on accuracy vs parameter count, FLOPS, and latency because of its simple, unified structure and parallel token processing. Due to its extensive use of attention, it is robust on arbitrarily-oriented text, which is common in real-world images. Code, pretrained weights, and data are available at: https://github.com/baudm/parseq.
引用
收藏
页码:178 / 196
页数:19
相关论文
共 73 条
[1]   Vision Transformer for Fast and Efficient Scene Text Recognition [J].
Atienza, Rowel .
DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT I, 2021, 12821 :319-334
[2]   Data Augmentation for Scene Text Recognition [J].
Atienza, Rowel .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, :1561-1570
[3]   What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels [J].
Baek, Jeonghun ;
Matsui, Yusuke ;
Aizawa, Kiyoharu .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :3112-3121
[4]   What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis [J].
Baek, Jeonghun ;
Kim, Geewook ;
Lee, Junyeop ;
Park, Sungrae ;
Han, Dongyoon ;
Yun, Sangdoo ;
Oh, Seong Joon ;
Lee, Hwalsuk .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4714-4722
[5]  
Baevski A., 2019, INT C LEARNING REPRE
[6]  
Bahdanau D., 2015, ICLR
[7]   Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition [J].
Bhunia, Ayan Kumar ;
Sain, Aneeshan ;
Kumar, Amandeep ;
Ghose, Shuvozit ;
Chowdhury, Pinaki Nath ;
Song, Yi-Zhe .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :14920-14929
[8]  
Bhunia Ayan Kumar, 2021, ICCV, P14950
[9]   Bidirectional Scene Text Recognition with a Single Decoder [J].
Bleeker, Maurits ;
de Rijke, Maarten .
ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 :2664-2671
[10]   Rosetta: Large Scale System for Text Detection and Recognition in Images [J].
Borisyuk, Fedor ;
Gordo, Albert ;
Sivakumar, Viswanath .
KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, :71-79