Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

被引:260
作者
Fang, Shancheng [1 ]
Xie, Hongtao [1 ]
Wang, Yuxin [1 ]
Mao, Zhendong [1 ]
Zhang, Yongdong [1 ]
机构
[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China
来源
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年
关键词
D O I
10.1109/CVPR46437.2021.00702
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Linguistic knowledge is of great benefit to scene text recognition. However, how to effectively model linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from: 1) implicitly language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet for scene text recognition. Firstly, the autonomous suggests to block gradient flow between vision and language models to enforce explicitly language modeling. Secondly, a novel bidirectional doze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for language model which can effectively alleviate the impact of noise input. Additionally, based on the ensemble of iterative predictions, we propose a self-training method which can learn from unlabeled images effectively. Extensive experiments indicate that ABINet has superiority on low-quality images and achieves state-of-the-art results on several mainstream benchmarks. Besides, the ABINet trained with ensemble self-training shows promising improvement in realizing human-level recognition.
引用
收藏
页码:7094 / 7103
页数:10
相关论文
共 55 条
[21]  
Li H, 2019, AAAI CONF ARTIF INTE, P8610
[22]   Fully Convolutional Instance-aware Semantic Segmentation [J].
Li, Yi ;
Qi, Haozhi ;
Dai, Jifeng ;
Ji, Xiangyang ;
Wei, Yichen .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4438-4446
[23]  
Liao MH, 2019, AAAI CONF ARTIF INTE, P8714
[24]   Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes [J].
Liao, Minghui ;
Lyu, Pengyuan ;
He, Minghang ;
Yao, Cong ;
Wu, Wenhao ;
Bai, Xiang .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (02) :532-548
[25]   Scene Text Detection and Recognition: The Deep Learning Era [J].
Long, Shangbang ;
He, Xin ;
Yao, Cong .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (01) :161-184
[26]  
Lyu P., 2019, ARXIV190605708
[27]  
Merity Stephen, 2016, Pointer sentinel mixture models
[28]   Scene Text Recognition using Higher Order Language Priors [J].
Mishra, Anand ;
Alahari, Karteek ;
Jawahar, C. V. .
PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE 2012, 2012,
[29]   SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition [J].
Qiao, Zhi ;
Zhou, Yu ;
Yang, Dongbao ;
Zhou, Yucan ;
Wang, Weiping .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :13525-13534
[30]   Self-training with Noisy Student improves ImageNet classification [J].
Xie, Qizhe ;
Luong, Minh-Thang ;
Hovy, Eduard ;
Le, Quoc, V .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10684-10695