TextDiff: Enhancing scene text image super-resolution with mask-guided residual diffusion models

被引:0
作者
Liu, Baolin [1 ]
Yang, Zongyuan [1 ]
Chiu, Chinwai [1 ]
Xiong, Yongping [1 ]
机构
[1] Beijing Univ Post & Telecommun, State Key Lab Switching & Networking Technol, Beijing 100876, Peoples R China
关键词
Scene text image super-resolution; Text enhancement; Diffusion model; Multi-stage learning; Model expandability; NETWORK;
D O I
10.1016/j.patcog.2025.111513
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The goal of scene text image super-resolution (STISR) is to reconstruct high-resolution text-line images from unrecognizable low-resolution inputs. The existing methods relying on the optimization of pixel-level loss tend to yield text edges that exhibit a notable degree of blurring, thereby exerting a substantial impact on both the readability and recognizability of the text. To address these issues, we propose TextDiff, the first diffusion-based framework tailored for STISR. It contains two modules: the Text Enhancement Module (TEM) and the Mask- Guided Residual Diffusion Module (MRD). The TEM generates an initial deblurred text image and a mask that encodes the spatial location of the text. The MRD is responsible for effectively sharpening the text edge by modeling the residuals between the ground-truth images and the initial deblurred images. Extensive experiments demonstrate that our TextDiff achieves state-of-the-art (SOTA) performance on public benchmark datasets, with a maximum improvement of 2.0% in recognition accuracy over existing methods while enhancing the readability of scene text images. Moreover, our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods. This enhancement not only improves the readability and recognizability of the results generated by SOTA methods but also does not require any additional joint training.
引用
收藏
页数:14
相关论文
共 42 条
[31]  
Wang XP, 2018, IDEAS HIST MOD CHINA, V19, P1, DOI 10.1163/9789004385580_002
[32]   Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data [J].
Wang, Xintao ;
Xie, Liangbin ;
Dong, Chao ;
Shan, Ying .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, :1905-1914
[33]   Deblurring via Stochastic Refinement [J].
Whang, Jay ;
Delbracio, Mauricio ;
Talebi, Hossein ;
Saharia, Chitwan ;
Dimakis, Alexandros G. ;
Milanfar, Peyman .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :16272-16282
[34]   MANIQA: Multi-dimension Attention Network for No-Reference Image Quality Assessment [J].
Yang, Sidi ;
Wu, Tianhe ;
Shi, Shuwei ;
Lao, Shanshan ;
Gong, Yuan ;
Cao, Mingdeng ;
Wang, Jiahao ;
Yang, Yujiu .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, :1190-1199
[35]   DocDiff: Document Enhancement via Residual Diffusion Models [J].
Yang, Zongyuan ;
Liu, Baolin ;
Xiong, Yongping ;
Yi, Lan ;
Wu, Guibin ;
Tang, Xiaojun ;
Liu, Ziqi ;
Zhou, Junjie ;
Zhang, Xing .
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, :2795-2806
[36]   GDB: Gated Convolutions-based Document Binarization [J].
Yang, Zongyuan ;
Liu, Baolin ;
Xiong, Yongping ;
Wu, Guibin .
PATTERN RECOGNITION, 2024, 146
[37]   The Unreasonable Effectiveness of Deep Features as a Perceptual Metric [J].
Zhang, Richard ;
Isola, Phillip ;
Efros, Alexei A. ;
Shechtman, Eli ;
Wang, Oliver .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :586-595
[38]  
Zhao C., 2024, IEEE Trans. Multimed.
[39]  
Zhao MY, 2022, PROCEEDINGS OF THE THIRTY-FIRST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2022, P1707
[40]   CDistNet: Perceiving Multi-domain Character Distance for Robust Text Recognition [J].
Zheng, Tianlun ;
Chen, Zhineng ;
Fang, Shancheng ;
Xie, Hongtao ;
Jiang, Yu-Gang .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (02) :300-318