Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

被引：253

作者：

Fang, Shancheng ^{[1
]}

Xie, Hongtao ^{[1
]}

Wang, Yuxin ^{[1
]}

Mao, Zhendong ^{[1
]}

Zhang, Yongdong ^{[1
]}

机构：

[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China

来源：

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年

关键词：

D O I：

10.1109/CVPR46437.2021.00702

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Linguistic knowledge is of great benefit to scene text recognition. However, how to effectively model linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from: 1) implicitly language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet for scene text recognition. Firstly, the autonomous suggests to block gradient flow between vision and language models to enforce explicitly language modeling. Secondly, a novel bidirectional doze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for language model which can effectively alleviate the impact of noise input. Additionally, based on the ensemble of iterative predictions, we propose a self-training method which can learn from unlabeled images effectively. Extensive experiments indicate that ABINet has superiority on low-quality images and achieves state-of-the-art results on several mainstream benchmarks. Besides, the ABINet trained with ensemble self-training shows promising improvement in realizing human-level recognition.

引用

页码：7094 / 7103

页数：10

共 55 条

[1]

[Anonymous], 2020, AAAI

[2]

[Anonymous], 2019, ICCV

[3]

[Anonymous], 2015, PROC INT C LEARN REP

[4]

Ba J., 2016, ARXIV160706450, V1050, P21

[5] What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis [J].

Baek, Jeonghun ;

Kim, Geewook ;

Lee, Junyeop ;

Park, Sungrae ;

Han, Dongyoon ;

Yun, Sangdoo ;

Oh, Seong Joon ;

Lee, Hwalsuk .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4714-4722

[6] AON: Towards Arbitrarily-Oriented Text Recognition [J].

Cheng, Zhanzhan ;

Xu, Yangliu ;

Bai, Fan ;

Niu, Yi ;

Pu, Shiliang ;

Zhou, Shuigeng .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5571-5579

[7] Focusing Attention: Towards Accurate Text Recognition in Natural Images [J].

Cheng, Zhanzhan ;

Bai, Fan ;

Xu, Yunlu ;

Zheng, Gang ;

Pu, Shiliang ;

Zhou, Shuigeng .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5086-5094

[8]

Deli Yu, 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Proceedings, P12110, DOI 10.1109/CVPR42600.2020.01213

[9]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[10] Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling [J].

Fang, Shancheng ;

Xie, Hongtao ;

Zha, Zheng-Jun ;

Sun, Nannan ;

Tan, Jianlong ;

Zhang, Yongdong .

PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, :248-256

← 1 2 3 4 5 6 →