Leveraging Text Localization for Scene Text Removal via Text-Aware Masked Image Modeling

被引:0
作者
Wang, Zixiao [1 ]
Xie, Hongtao [1 ]
Wang, YuXin [1 ]
Qu, Yadong [1 ]
Guo, Fengjun [2 ]
Liu, Pengwei [2 ]
机构
[1] Univ Sci & Technol China, Hefei, Peoples R China
[2] IntSig Informat Co Ltd, Shanghai, Peoples R China
来源
COMPUTER VISION - ECCV 2024, PT LXVI | 2025年 / 15124卷
关键词
Scene text removal; Pretraining; Masked image modeling;
D O I
10.1007/978-3-031-72848-8_21
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at https://github.com/wzx99/TMIM.
引用
收藏
页码:357 / 373
页数:17
相关论文
共 46 条
[1]   Character Region Awareness for Text Detection [J].
Baek, Youngmin ;
Lee, Bado ;
Han, Dongyoon ;
Yun, Sangdoo ;
Lee, Hwalsuk .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9357-9366
[2]  
Bao H., 2021, arXiv, DOI DOI 10.48550/ARXIV.2106.08254
[3]   Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition [J].
Ch'ng, Chee Kheng ;
Chan, Chee Seng .
2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, :935-942
[4]  
Chee Kheng Chng, 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR). Proceedings, P1571, DOI 10.1109/ICDAR.2019.00252
[5]   Modeling Stroke Mask for End-to-End Text Erasing [J].
Du, Xiangcheng ;
Zhou, Zhao ;
Zheng, Yingbin ;
Ma, Tianlong ;
Wu, Xingjiao ;
Jin, Cheng .
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, :6140-6148
[6]  
Feng H, 2024, Arxiv, DOI arXiv:2402.19108
[7]   Dual Part Discovery Network for Zero-Shot Learning [J].
Ge, Jiannan ;
Xie, Hongtao ;
Min, Shaobo ;
Li, Pandeng ;
Zhang, Yongdong .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :3244-3252
[8]  
Ge JN, 2021, AAAI CONF ARTIF INTE, V35, P1406
[9]   Image Inpainting via Conditional Texture and Structure Dual Generation [J].
Guo, Xiefan ;
Yang, Hongyu ;
Huang, Di .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :14114-14123
[10]   Synthetic Data for Text Localisation in Natural Images [J].
Gupta, Ankush ;
Vedaldi, Andrea ;
Zisserman, Andrew .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2315-2324