TextDiff: Enhancing scene text image super-resolution with mask-guided residual diffusion models

被引：0

作者：

Liu, Baolin ^{[1
]}

Yang, Zongyuan ^{[1
]}

Chiu, Chinwai ^{[1
]}

Xiong, Yongping ^{[1
]}

机构：

[1] Beijing Univ Post & Telecommun, State Key Lab Switching & Networking Technol, Beijing 100876, Peoples R China

来源：

PATTERN RECOGNITION | 2025年 / 164卷

关键词：

Scene text image super-resolution; Text enhancement; Diffusion model; Multi-stage learning; Model expandability; NETWORK;

D O I：

10.1016/j.patcog.2025.111513

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The goal of scene text image super-resolution (STISR) is to reconstruct high-resolution text-line images from unrecognizable low-resolution inputs. The existing methods relying on the optimization of pixel-level loss tend to yield text edges that exhibit a notable degree of blurring, thereby exerting a substantial impact on both the readability and recognizability of the text. To address these issues, we propose TextDiff, the first diffusion-based framework tailored for STISR. It contains two modules: the Text Enhancement Module (TEM) and the Mask- Guided Residual Diffusion Module (MRD). The TEM generates an initial deblurred text image and a mask that encodes the spatial location of the text. The MRD is responsible for effectively sharpening the text edge by modeling the residuals between the ground-truth images and the initial deblurred images. Extensive experiments demonstrate that our TextDiff achieves state-of-the-art (SOTA) performance on public benchmark datasets, with a maximum improvement of 2.0% in recognition accuracy over existing methods while enhancing the readability of scene text images. Moreover, our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods. This enhancement not only improves the readability and recognizability of the results generated by SOTA methods but also does not require any additional joint training.

引用

页数：14

共 42 条

[1] The Perception-Distortion Tradeoff
Blau, Yochai
Michaeli, Tomer
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6228 - 6237
[2] Scan2Cap: Context-aware Dense Captioning in RGB-D Scans
Chen, Dave Zhenyu
Gholami, Ali
Niesner, Matthias
Chang, Angel X.
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3192 - 3202
[3] Chen JY, 2022, AAAI CONF ARTIF INTE, P285
[4] Scene Text Telescope: Text-Focused Scene Image Super-Resolution
Chen, Jingye
Li, Bin
Xue, Xiangyang
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12021 - 12030
[5] Activating More Pixels in Image Super-Resolution Transformer
Chen, Xiangyu
Wang, Xintao
Zhou, Jiantao
Qiao, Yu
Dong, Chao
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22367 - 22377
[6] Dual Aggregation Transformer for Image Super-Resolution
Chen, Zheng
Zhang, Yulun
Gu, Jinjin
Kong, Linghe
Yang, Xiaokang
Yu, Fisher
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 12278 - 12287
[7] Levenshtein OCR
Da, Cheng
Wang, Peng
Yao, Cong
[J]. COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 322 - 338
[8] Image Super-Resolution Using Deep Convolutional Networks
Dong, Chao
Loy, Chen Change
He, Kaiming
Tang, Xiaoou
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (02) : 295 - 307
[9] Learning a Deep Convolutional Network for Image Super-Resolution
Dong, Chao
Loy, Chen Change
He, Kaiming
Tang, Xiaoou
[J]. COMPUTER VISION - ECCV 2014, PT IV, 2014, 8692 : 184 - 199
[10] Self-supervised memory learning for scene text image super-resolution
Guo, Kehua
Zhu, Xiangyuan
Schaefer, Gerald
Ding, Rui
Fang, Hui
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 258

← 1 2 3 4 5 →