Combining Swin Transformer and Attention-Weighted Fusion for Scene Text Detection

被引：4

作者：

Li, Xianguo ^{[1
,2
]}

Yao, Xingchen ^{[1
,2
]}

Liu, Yi ^{[2
,3
]}

机构：

[1] Tiangong Univ, Sch Elect & Informat Engn, Tianjin 300387, Peoples R China

[2] Tianjin Key Lab Optoelect Detect Technol & Syst, Tianjin 300387, Peoples R China

[3] Tiangong Univ, Ctr Engn Internship & Training, Tianjin 300387, Peoples R China

来源：

NEURAL PROCESSING LETTERS | 2024年 / 56卷 / 02期

关键词：

Scene text detection; Swin transformer; Attention-weighted fusion; Global feature perception;

D O I：

10.1007/s11063-024-11501-7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The existing text detection algorithms based on Convolutional Neural Networks (CNN) commonly have the problems of insufficient receptive fields and inadequate extraction of spatial positional information, which limit their ability to detect large-scale variation text instances, long-distance and wide-spaced text instances as well as effectively distinguish complex background textures. To address the above problems, in this paper, a scene text detection algorithm combining Swin Transformer and attention-weighted fusion is proposed. Firstly, an attention-weighted fusion (AWF) module is proposed, which embeds a modified coordinate attention module (CAM) in the feature pyramid network (FPN). This module learns spatial positional weights of foreground information in different-scale features while suppressing redundant background information. As a result, the fused features are more focused on the text regions, enhancing the localization ability for text regions and boundaries. Secondly, the window-based self-attention mechanism of the Swin Transformer is utilized to achieve global feature perception on the fused features of the pyramid network. This compensates for the insufficient receptive fields of CNN and enhances the representation capability of global contextual features, thereby further improving the performance of text detection. Experimental results demonstrate that the proposed algorithm achieves competitive performance on three public datasets, namely ICDAR2015, MSRA-TD500, and Total-Text, with F-measure reaching 87.9%, 91.4%, and 86.7%, respectively. Code is available at: https://github.com/xgli411/ST-AWFNet.

引用

页数：22

共 48 条

[11] Coordinate Attention for Efficient Mobile Network Design [J].

Hou, Qibin ;

Zhou, Daquan ;

Feng, Jiashi .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :13708-13717

[12]

Hu J, 2018, PROC CVPR IEEE, P7132, DOI [10.1109/TPAMI.2019.2913372, 10.1109/CVPR.2018.00745]

[13] CCNet: Criss-Cross Attention for Semantic Segmentation [J].

Huang, Zilong ;

Wang, Xinggang ;

Huang, Lichao ;

Huang, Chang ;

Wei, Yunchao ;

Liu, Wenyu .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :603-612

[14]

Karatzas D, 2015, PROC INT CONF DOC, P1156, DOI 10.1109/ICDAR.2015.7333942

[15] Real-Time Scene Text Detection With Differentiable Binarization and Adaptive Scale Fusion [J].

Liao, Minghui ;

Zou, Zhisheng ;

Wan, Zhaoyi ;

Yao, Cong ;

Bai, Xiang .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (01) :919-931

[16]

Liao MH, 2020, AAAI CONF ARTIF INTE, V34, P11474

[17] TextBoxes plus plus : A Single-Shot Oriented Scene Text Detector [J].

Liao, Minghui ;

Shi, Baoguang ;

Bai, Xiang .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2018, 27 (08) :3676-3690

[18]

Liao MH, 2017, AAAI CONF ARTIF INTE, P4161

[19] Feature Pyramid Networks for Object Detection [J].

Lin, Tsung-Yi ;

Dollar, Piotr ;

Girshick, Ross ;

He, Kaiming ;

Hariharan, Bharath ;

Belongie, Serge .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :936-944

[20] SSD: Single Shot MultiBox Detector [J].

Liu, Wei ;

Anguelov, Dragomir ;

Erhan, Dumitru ;

Szegedy, Christian ;

Reed, Scott ;

Fu, Cheng-Yang ;

Berg, Alexander C. .

COMPUTER VISION - ECCV 2016, PT I, 2016, 9905 :21-37

← 1 2 3 4 5 →