Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text Image Super-Resolution

被引：6

作者：

Zhang, Wenyu ^{[1
]}

Deng, Xin ^{[1
]}

Jia, Baojun ^{[1
]}

Yu, Xingtong ^{[1
]}

Chen, Yifan ^{[2
]}

Ma, Jin ^{[2
]}

Ding, Qing ^{[1
]}

Zhang, Xinming ^{[1
]}

机构：

[1] Univ Sci & Technol China, Hefei, Peoples R China

[2] China Merchants Bank, Chengdu, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

关键词：

scene text image super-resolution; vision backbone; pixel-wise graph attention; RECOGNITION; NETWORK;

D O I：

10.1145/3581783.3611913

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current Scene text image super-resolution approaches primarily focus on extracting robust features, acquiring text information, and complex training strategies to generate super-resolution images. However, the upsampling module, which is crucial in the process of converting low-resolution images to high-resolution ones, has received little attention in existing works. To address this issue, we propose the Pixel Adapter Module (PAM) based on graph attention to address pixel distortion caused by upsampling. The PAM effectively captures local structural information by allowing each pixel to interact with its neighbors and update features. Unlike previous graph attention mechanisms, our approach achieves 2-3 orders of magnitude improvement in efficiency and memory utilization by eliminating the dependency on sparse adjacency matrices and introducing a sliding window approach for efficient parallel computation. Additionally, we introduce the MLP-based Sequential Residual Block (MSRB) for robust feature extraction from text images, and a Local Contour Awareness loss (L-lca) to enhance the model's perception of details. Comprehensive experiments on TextZoom demonstrate that our proposed method generates high-quality super-resolution images, surpassing existing methods in recognition accuracy. For single-stage and multi-stage strategies, we achieved improvements of 0.7% and 2.6%, respectively, increasing the performance from 52.6% and 53.7% to 53.3% and 56.3%. The code is available at https://github.com/wenyu1009/RTSRN.

引用

页码：2168 / 2179

页数：12

共 47 条

[1] A set of benchmarks for Handwritten Text Recognition on historical documents [J].

Andreu Sanchez, Joan ;

Romero, Veronica ;

Toselli, Alejandro H. ;

Villegas, Mauricio ;

Vidal, Enrique .

PATTERN RECOGNITION, 2019, 94 :122-134

[2]

[Anonymous], 2018, Fvi: An end-to-end vietnamese identification card detection and recognition in images

[3]

[Anonymous], 2018, PROCEEDINGS OF THE I, DOI DOI 10.1109/CVPR.2018.00474

[4]

Cai Jianrui, 2019, ABS190400523 CORR

[5] Scene Text Telescope: Text-Focused Scene Image Super-Resolution [J].

Chen, Jingye ;

Li, Bin ;

Xue, Xiangyang .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :12021-12030

[6]

Chen Jingye, 2021, ABS211208171 CORR

[7] Histograms of oriented gradients for human detection [J].

Dalal, N ;

Triggs, B .

2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :886-893

[8] Learning a Deep Convolutional Network for Image Super-Resolution [J].

Dong, Chao ;

Loy, Chen Change ;

He, Kaiming ;

Tang, Xiaoou .

COMPUTER VISION - ECCV 2014, PT IV, 2014, 8692 :184-199

[9]

Dosovitskiy A., 2020, PREPRINT

[10] An automatic road sign recognition system based on a computational model of human recognition processing [J].

Fang, CY ;

Fuh, CS ;

Yen, PS ;

Cherng, S ;

Chen, SW .

COMPUTER VISION AND IMAGE UNDERSTANDING, 2004, 96 (02) :237-268

← 1 2 3 4 5 →