Transforming Scene Text Detection and Recognition: A Multi-Scale End-to-End Approach With Transformer Framework

被引：0

作者：

Geng, Tianyu ^{[1
]}

机构：

[1] Nanjing Tech Univ, Coll Artificial Intelligence, Coll Comp & Informat Engn, Nanjing 211816, Jiangsu, Peoples R China

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Text recognition; text recognition; transformer; end-to-end; multi-scale;

D O I：

10.1109/ACCESS.2024.3375497

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Text is an essential means for humans to acquire information and engage in social communication. Accurate text extraction from images is crucial for various tasks in real-life scenarios and scene understanding. However, text detection and recognition in natural scenes are challenged by noise in the images, irregular distribution of text fonts, and degradation of image quality under complex acquisition conditions. These factors severely impact the accuracy of text recognition. Issues such as poor image quality, diverse text formats, and complex image backgrounds significantly affect the accuracy of the recognition, and these challenges remain urgent to be addressed in the field. To address these challenges, this paper proposes a transformer-based scene image text detection and recognition algorithm within a multi-scale end-to-end framework. Firstly, by integrating detection and recognition stages into an end-to-end framework, the process is simplified, reducing computation and errors. Subsequently, multi-scale characteristics are incorporated to effectively capture text information at various scales, enhancing recognition accuracy and robustness through feature fusion and anti-interference capability. Lastly, leveraging the transformer framework, the algorithm efficiently handles text information of different scales and positions, improving generalization ability. The self-attention mechanism, multi-layer stacking structure, and positional encoding in the transformer framework contribute to its effectiveness in processing diverse text information. Through validation, the proposed method demonstrates improved efficiency in scene text detection and recognition.

引用

页码：40582 / 40596

页数：15

共 50 条

[21] A Robust Ensemble of ResNets for Character Level End-to-end Text Detection in Natural Scene Images
Kim, Jinsu
Kim, Yoonhyung
Kim, Changick
PROCEEDINGS OF THE 15TH INTERNATIONAL WORKSHOP ON CONTENT-BASED MULTIMEDIA INDEXING (CBMI), 2017,
[22] HierTTS: Expressive End-to-End Text-to-Waveform Using a Multi-Scale Hierarchical Variational Auto-Encoder
Shang, Zengqiang
Shi, Peiyang
Zhang, Pengyuan
Wang, Li
Zhao, Guangying
APPLIED SCIENCES-BASEL, 2023, 13 (02):
[23] A Multi-level Acoustic Feature Extraction Framework for Transformer Based End-to-End Speech Recognition
Li, Jin
Su, Rongfeng
Xie, Xurong
Yan, Nan
Wang, Lan
INTERSPEECH 2022, 2022, : 3173 - 3177
[24] End-to-end automated speech recognition using a character based small scale transformer architecture
Loubser, Alexander
De Villiers, Pieter
De Freitas, Allan
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 252
[25] MSARN: A Multi-scale Attention Residual Network for End-to-End Environmental Sound Classification
Fucai Hu
Peng Song
Ruhan He
Zhaoli Yan
Yongsheng Yu
Neural Processing Letters, 2023, 55 : 11449 - 11465
[26] MSARN: A Multi-scale Attention Residual Network for End-to-End Environmental Sound Classification
Hu, Fucai
Song, Peng
He, Ruhan
Yan, Zhaoli
Yu, Yongsheng
NEURAL PROCESSING LETTERS, 2023, 55 (08) : 11449 - 11465
[27] End-to-end lane detection with convolution and transformer
Ge, Zekun
Ma, Chao
Fu, Zhumu
Song, Shuzhong
Si, Pengju
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (19) : 29607 - 29627
[28] OctShuffleMLT: A Compact Octave Based Neural Network for End-to-End Multilingual Text Detection and Recognition
Lundgren, Antonio
Castro, Dayvid
Lima, Estanislau
Bezerra, Byron
2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION WORKSHOPS (ICDARW) AND 8TH INTERNATIONAL WORKSHOP ON CAMERA-BASED DOCUMENT ANALYSIS AND RECOGNITION, VOL 4, 2019, : 37 - 42
[29] End-to-end lane detection with convolution and transformer
Zekun Ge
Chao Ma
Zhumu Fu
Shuzhong Song
Pengju Si
Multimedia Tools and Applications, 2023, 82 : 29607 - 29627
[30] GeometryMotion-Transformer: An End-to-End Framework for 3D Action Recognition
Liu, Jiaheng
Guo, Jinyang
Xu, Dong
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5649 - 5661

← 1 2 3 4 5 →