Scene Text Recognition with Permuted Autoregressive Sequence Models

被引:78
|
作者
Bautista, Darwin [1 ]
Atienza, Rowel [1 ]
机构
[1] Univ Philippines, Elect & Elect Engn Inst, Quezon City, Philippines
来源
COMPUTER VISION - ECCV 2022, PT XXVIII | 2022年 / 13688卷
关键词
Scene text recognition; Permutation language modeling; Autoregressive modeling; Cross-modal attention; Transformer;
D O I
10.1007/978-3-031-19815-1_11
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Context-aware STR methods typically use internal autoregressive (AR) language models (LM). Inherent limitations of AR models motivated two-stage methods which employ an external LM. The conditional independence of the external LM on the input image may cause it to erroneously rectify correct predictions, leading to significant inefficiencies. Our method, PARSeq, learns an ensemble of internal AR LMs with shared weights using Permutation Language Modeling. It unifies context-free non-AR and context-aware AR inference, and iterative refinement using bidirectional context. Using synthetic training data, PARSeq achieves state-of-the-art (SOTA) results in STR benchmarks (91.9% accuracy) and more challenging datasets. It establishes new SOTA results (96.0% accuracy) when trained on real data. PARSeq is optimal on accuracy vs parameter count, FLOPS, and latency because of its simple, unified structure and parallel token processing. Due to its extensive use of attention, it is robust on arbitrarily-oriented text, which is common in real-world images. Code, pretrained weights, and data are available at: https://github.com/baudm/parseq.
引用
收藏
页码:178 / 196
页数:19
相关论文
共 50 条
  • [41] Transformer-based end-to-end scene text recognition
    Zhu, Xinghao
    Zhang, Zhi
    PROCEEDINGS OF THE 2021 IEEE 16TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA 2021), 2021, : 1691 - 1695
  • [42] STR Transformer: A Cross-domain Transformer for Scene Text Recognition
    Xing Wu
    Bin Tang
    Ming Zhao
    Jianjia Wang
    Yike Guo
    Applied Intelligence, 2023, 53 : 3444 - 3458
  • [43] STR Transformer: A Cross-domain Transformer for Scene Text Recognition
    Wu, Xing
    Tang, Bin
    Zhao, Ming
    Wang, Jianjia
    Guo, Yike
    APPLIED INTELLIGENCE, 2023, 53 (03) : 3444 - 3458
  • [44] Image-to-Character-to-Word Transformers for Accurate Scene Text Recognition
    Xue, Chuhui
    Huang, Jiaxing
    Zhang, Wenqing
    Lu, Shijian
    Wang, Changhu
    Bai, Song
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 12908 - 12921
  • [45] An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition
    Shi, Baoguang
    Bai, Xiang
    Yao, Cong
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (11) : 2298 - 2304
  • [46] CATNet: Scene Text Recognition Guided by Concatenating Augmented Text Features
    Zhang, Ziyin
    Pan, Lemeng
    Du, Lin
    Li, Qingrui
    Lu, Ning
    DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT I, 2021, 12821 : 350 - 365
  • [47] Hybrid Autoregressive and Non-Autoregressive Transformer Models for Speech Recognition
    Tian, Zhengkun
    Yi, Jiangyan
    Tao, Jianhua
    Zhang, Shuai
    Wen, Zhengqi
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 762 - 766
  • [48] Background-Insensitive Scene Text Recognition with Text Semantic Segmentation
    Zhao, Liang
    Wu, Zhenyao
    Wu, Xinyi
    Wilsbacher, Greg
    Wang, Song
    COMPUTER VISION, ECCV 2022, PT XXV, 2022, 13685 : 163 - 182
  • [49] Reading scene text with fully convolutional sequence modeling
    Gao, Yunze
    Chen, Yingying
    Wang, Jinqiao
    Tang, Ming
    Lu, Hanqing
    NEUROCOMPUTING, 2019, 339 : 161 - 170
  • [50] Multi-granularity Prediction for Scene Text Recognition
    Wang, Peng
    Da, Cheng
    Yao, Cong
    COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 339 - 355