SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition

被引:69
作者
Huang, Mingxin [1 ]
Liu, Yuliang [2 ]
Peng, Zhenghao [2 ]
Liu, Chongyu [1 ]
Lin, Dahua [2 ]
Zhu, Shenggao [3 ]
Yuan, Nicholas [3 ]
Ding, Kai [4 ]
Jin, Lianwen [1 ,5 ]
机构
[1] South China Univ Technol, Guangzhou, Peoples R China
[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[3] Huawei Cloud AI, Shenzhen, Peoples R China
[4] IntSig Informat Co Ltd, Shanghai, Peoples R China
[5] Peng Cheng Lab, Shenzhen, Peoples R China
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.00455
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
End-to-end scene text spotting has attracted great attention in recent years due to the success of excavating the intrinsic synergy of the scene text detection and recognition. However, recent state-of-the-art methods usually incorporate detection and recognition simply by sharing the backbone, which does not directly take advantage of the feature interaction between the two tasks. In this paper, we propose a new end-to-end scene text spotting framework termed SwinTextSpotter. Using a transformer encoder with dynamic head as the detector, we unify the two tasks with a novel Recognition Conversion mechanism to explicitly guide text localization through recognition loss. The straightforward design results in a concise framework that requires neither additional rectification module nor character-level annotation for the arbitrarily-shaped text. Qualitative and quantitative experiments on multi-oriented datasets RoIC13 and ICDAR 2015, arbitrarily-shaped datasets Total-Text and CTW1500, and multi-lingual datasets ReCTS (Chinese) and VinText (Vietnamese) demonstrate SwinTextSpotter significantly outperforms existing methods. Code is available at https://github.com/mxin262/SwinTextSpotter.
引用
收藏
页码:4583 / 4593
页数:11
相关论文
共 68 条
  • [1] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
  • [2] PhotoOCR: Reading Text in Uncontrolled Conditions
    Bissacco, Alessandro
    Cummins, Mark
    Netzer, Yuval
    Neven, Hartmut
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 785 - 792
  • [4] Carion N., 2020, EUROPEAN C COMPUTER, V12346, P213, DOI 10.1007/978-3-030-58452-8_13
  • [5] Total-Text: toward orientation robustness in scene text detection
    Ch'ng, Chee-Kheng
    Chan, Chee Seng
    Liu, Cheng-Lin
    [J]. INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2020, 23 (01) : 31 - 52
  • [6] Chee Kheng Chng, 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR). Proceedings, P1571, DOI 10.1109/ICDAR.2019.00252
  • [7] Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition
    Fang, Shancheng
    Xie, Hongtao
    Wang, Yuxin
    Mao, Zhendong
    Zhang, Yongdong
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7094 - 7103
  • [8] TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting
    Feng, Wei
    He, Wenhao
    Yin, Fei
    Zhang, Xu-Yao
    Liu, Cheng-Lin
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9075 - 9084
  • [9] TextProposals: A text-specific selective search algorithm for word spotting in the wild
    Gomez, Lluis
    Karatzas, Dimosthenis
    [J]. PATTERN RECOGNITION, 2017, 70 : 60 - 74
  • [10] Graves A., 2006, MACHINE LEARNING P 2, P369, DOI [DOI 10.1145/1143844.1143891, 10.1145/1143844.1143891]