CDistNet: Perceiving Multi-domain Character Distance for Robust Text Recognition

被引:18
作者
Zheng, Tianlun [1 ,2 ]
Chen, Zhineng [1 ,2 ]
Fang, Shancheng [3 ]
Xie, Hongtao [3 ]
Jiang, Yu-Gang [1 ,2 ]
机构
[1] Fudan Univ, Sch Comp Sci, -, -, Shanghai 200438, Peoples R China
[2] Fudan Univ, Shanghai Collaborat Innovat Ctr Intelligent Visual, Shanghai 200438, Peoples R China
[3] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230026, Peoples R China
基金
中国国家自然科学基金;
关键词
Scene text recognition; Attention mechanism; Position embedding; Character distance; SCENE; NETWORK;
D O I
10.1007/s11263-023-01880-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The transformer-based encoder-decoder framework is becoming popular in scene text recognition, largely because it naturally integrates recognition clues from both visual and semantic domains. However, recent studies show that the two kinds of clues are not always well registered and therefore, feature and character might be misaligned in difficult text (e.g., with a rare shape). As a result, constraints such as character position are introduced to alleviate this problem. Despite certain success, visual and semantic are still separately modeled and they are merely loosely associated. In this paper, we propose a novel module called multi-domain character distance perception (MDCDP) to establish a visually and semantically related position embedding. MDCDP uses the position embedding to query both visual and semantic features following the cross-attention mechanism. The two kinds of clues are fused into the position branch, generating a content-aware embedding that well perceives character spacing and orientation variants, character semantic affinities, and clues tying the two kinds of information. They are summarized as the multi-domain character distance. We develop CDistNet that stacks multiple MDCDPs to guide a gradually precise distance modeling. Thus, the feature-character alignment is well build even though various recognition difficulties are presented. We verify CDistNet on ten challenging public datasets and two series of augmented datasets created by ourselves. The experiments demonstrate that CDistNet performs highly competitively. It not only ranks top-tier in standard benchmarks, but also outperforms recent popular methods by obvious margins on real and augmented datasets presenting severe text deformation, poor linguistic support, and rare character layouts. In addition, the visualization shows that CDistNet achieves proper information utilization in both visual and semantic domains. Our code is available at https://github.com/simplify23/CDistNet.
引用
收藏
页码:300 / 318
页数:19
相关论文
共 64 条
[1]   What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels [J].
Baek, Jeonghun ;
Matsui, Yusuke ;
Aizawa, Kiyoharu .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :3112-3121
[2]   What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis [J].
Baek, Jeonghun ;
Kim, Geewook ;
Lee, Junyeop ;
Park, Sungrae ;
Han, Dongyoon ;
Yun, Sangdoo ;
Oh, Seong Joon ;
Lee, Hwalsuk .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4714-4722
[3]  
Bai JF, 2014, INT CONF ACOUST SPEE, DOI 10.1109/ICASSP.2014.6853823
[4]   Scene Text Recognition with Permuted Autoregressive Sequence Models [J].
Bautista, Darwin ;
Atienza, Rowel .
COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 :178-196
[5]   Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition [J].
Bhunia, Ayan Kumar ;
Sain, Aneeshan ;
Kumar, Amandeep ;
Ghose, Shuvozit ;
Chowdhury, Pinaki Nath ;
Song, Yi-Zhe .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :14920-14929
[6]  
Chee Kheng Chng, 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR). Proceedings, P1571, DOI 10.1109/ICDAR.2019.00252
[7]   Text Recognition in the Wild: A Survey [J].
Chen, Xiaoxue ;
Jin, Lianwen ;
Zhu, Yuanzhi ;
Luo, Canjie ;
Wang, Tianwei .
ACM COMPUTING SURVEYS, 2021, 54 (02)
[8]   Adaptive embedding gate for attention-based scene text recognition [J].
Chen, Xiaoxue ;
Wang, Tianwei ;
Zhu, Yuanzhi ;
Jin, Lianwen ;
Luo, Canjie .
NEUROCOMPUTING, 2020, 381 :261-271
[9]   AON: Towards Arbitrarily-Oriented Text Recognition [J].
Cheng, Zhanzhan ;
Xu, Yangliu ;
Bai, Fan ;
Niu, Yi ;
Pu, Shiliang ;
Zhou, Shuigeng .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5571-5579
[10]   Focusing Attention: Towards Accurate Text Recognition in Natural Images [J].
Cheng, Zhanzhan ;
Bai, Fan ;
Xu, Yunlu ;
Zheng, Gang ;
Pu, Shiliang ;
Zhou, Shuigeng .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5086-5094