SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition

被引：69

作者：

Huang, Mingxin ^{[1
]}

Liu, Yuliang ^{[2
]}

Peng, Zhenghao ^{[2
]}

Liu, Chongyu ^{[1
]}

Lin, Dahua ^{[2
]}

Zhu, Shenggao ^{[3
]}

Yuan, Nicholas ^{[3
]}

Ding, Kai ^{[4
]}

Jin, Lianwen ^{[1
,5
]}

机构：

[1] South China Univ Technol, Guangzhou, Peoples R China

[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[3] Huawei Cloud AI, Shenzhen, Peoples R China

[4] IntSig Informat Co Ltd, Shanghai, Peoples R China

[5] Peng Cheng Lab, Shenzhen, Peoples R China

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.00455

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

End-to-end scene text spotting has attracted great attention in recent years due to the success of excavating the intrinsic synergy of the scene text detection and recognition. However, recent state-of-the-art methods usually incorporate detection and recognition simply by sharing the backbone, which does not directly take advantage of the feature interaction between the two tasks. In this paper, we propose a new end-to-end scene text spotting framework termed SwinTextSpotter. Using a transformer encoder with dynamic head as the detector, we unify the two tasks with a novel Recognition Conversion mechanism to explicitly guide text localization through recognition loss. The straightforward design results in a concise framework that requires neither additional rectification module nor character-level annotation for the arbitrarily-shaped text. Qualitative and quantitative experiments on multi-oriented datasets RoIC13 and ICDAR 2015, arbitrarily-shaped datasets Total-Text and CTW1500, and multi-lingual datasets ReCTS (Chinese) and VinText (Vietnamese) demonstrate SwinTextSpotter significantly outperforms existing methods. Code is available at https://github.com/mxin262/SwinTextSpotter.

引用

页码：4583 / 4593

页数：11

共 68 条

[1] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[2] PhotoOCR: Reading Text in Uncontrolled Conditions
Bissacco, Alessandro
Cummins, Mark
Netzer, Yuval
Neven, Hartmut
[J]. 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 785 - 792
[3] PRINCIPAL WARPS - THIN-PLATE SPLINES AND THE DECOMPOSITION OF DEFORMATIONS
BOOKSTEIN, FL
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1989, 11 (06) : 567 - 585
[4] Carion N., 2020, EUROPEAN C COMPUTER, V12346, P213, DOI 10.1007/978-3-030-58452-8_13
[5] Total-Text: toward orientation robustness in scene text detection
Ch'ng, Chee-Kheng
Chan, Chee Seng
Liu, Cheng-Lin
[J]. INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2020, 23 (01) : 31 - 52
[6] Chee Kheng Chng, 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR). Proceedings, P1571, DOI 10.1109/ICDAR.2019.00252
[7] Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition
Fang, Shancheng
Xie, Hongtao
Wang, Yuxin
Mao, Zhendong
Zhang, Yongdong
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7094 - 7103
[8] TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting
Feng, Wei
He, Wenhao
Yin, Fei
Zhang, Xu-Yao
Liu, Cheng-Lin
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9075 - 9084
[9] TextProposals: A text-specific selective search algorithm for word spotting in the wild
Gomez, Lluis
Karatzas, Dimosthenis
[J]. PATTERN RECOGNITION, 2017, 70 : 60 - 74
[10] Graves A., 2006, MACHINE LEARNING P 2, P369, DOI [DOI 10.1145/1143844.1143891, 10.1145/1143844.1143891]

← 1 2 3 4 5 6 7 →