MASTER: Multi-aspect non-local network for scene text recognition

被引:120
作者
Lu, Ning [1 ]
Yu, Wenwen [1 ,2 ]
Qi, Xianbiao [1 ]
Chen, Yihao [1 ]
Gong, Ping [2 ]
Xiao, Rong [1 ]
Bai, Xiang [3 ]
机构
[1] Ping An Property & Casualty Insurance Co, Visual Comp Grp, Shenzhen, Peoples R China
[2] Xuzhou Med Univ, Sch Med Imaging, 209 Tongshan Rd, Xuzhou 221000, Jiangsu, Peoples R China
[3] Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan, Peoples R China
关键词
Scene text recognition; Transformer; Non-local network; Memory-cached mechanism; ATTENTION NETWORK;
D O I
10.1016/j.patcog.2021.107980
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Attention-based scene text recognizers have gained huge success, which leverages a more compact in-termediate representation to learn 1d-or 2d-attention by a RNN-based encoder-decoder architecture. However, such methods suffer from attention-drift problem because high similarity among encoded features leads to attention confusion under the RNN-based local attention mechanism. Moreover, RNN-based methods have low efficiency due to poor parallelization. To overcome these problems, we propose the MASTER, a self-attention based scene text recognizer that (1) not only encodes the input-output at-tention but also learns self-attention which encodes feature-feature and target-target relationships inside the encoder and decoder and (2) learns a more powerful and robust intermediate representation to spa-tial distortion, and (3) owns a great training efficiency because of high training parallelization and a high-speed inference because of an efficient memory-cache mechanism. Extensive experiments on var-ious benchmarks demonstrate the superior performance of our MASTER on both regular and irregular scene text. Pytorch code can be found at https://github.com/wenwenyu/MASTER-pytorch, and Tensorflow code can be found at https://github.com/jiangxiluning/MASTER-TF . (c) 2021 Elsevier Ltd. All rights reserved.
引用
收藏
页数:10
相关论文
共 55 条
[1]  
[Anonymous], 2016, PATTERN RECOGN
[2]   What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis [J].
Baek, Jeonghun ;
Kim, Geewook ;
Lee, Junyeop ;
Park, Sungrae ;
Han, Dongyoon ;
Yun, Sangdoo ;
Oh, Seong Joon ;
Lee, Hwalsuk .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4714-4722
[3]   Edit Probability for Scene Text Recognition [J].
Bai, Fan ;
Cheng, Zhanzhan ;
Niu, Yi ;
Pu, Shiliang ;
Zhou, Shuigeng .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1508-1516
[5]   GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond [J].
Cao, Yue ;
Xu, Jiarui ;
Lin, Stephen ;
Wei, Fangyun ;
Hu, Han .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, :1971-1980
[6]  
Chen Z., 2018, 2019 INT C DOCUMENT, P781
[7]   Focusing Attention: Towards Accurate Text Recognition in Natural Images [J].
Cheng, Zhanzhan ;
Bai, Fan ;
Xu, Yunlu ;
Zheng, Gang ;
Pu, Shiliang ;
Zhou, Shuigeng .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5086-5094
[8]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9]   A pooling based scene text proposal technique for scene text reading in the wild [J].
Dinh NguyenVan ;
Lu, Shijian ;
Tian, Shangxuan ;
Ouarti, Nizar ;
Mokhtari, Mounir .
PATTERN RECOGNITION, 2019, 87 :118-129
[10]   Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling [J].
Fang, Shancheng ;
Xie, Hongtao ;
Zha, Zheng-Jun ;
Sun, Nannan ;
Tan, Jianlong ;
Zhang, Yongdong .
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, :248-256