GDText-VM: an arbitrary-shaped scene text detector based on globally deformable VMamba

被引：0

作者：

Yingnan Zhao ^{[1
]}

Zheng Hu ^{[1
]}

Fangqi Ding ^{[1
]}

Jielin Jiang ^{[2
]}

Xiaolong Xu ^{[2
]}

机构：

[1] Nanjing University of Information Science and Technology,School of Computer and Science

[2] Nanjing University of Information Science and Technology,School of Software

来源：

Complex & Intelligent Systems | 2025年 / 11卷 / 8期

关键词：

Computer vision; Globally Deformable VMamba; Attention mechanism; Scene text detection;

D O I：

10.1007/s40747-025-01987-6

中图分类号：

学科分类号：

摘要：

Detecting arbitrary-shaped text in natural scenes remains a significant challenge in deep learning research. Contemporary text detectors based on Convolutional Neural Networks face challenges in effectively modeling long-range dependencies. While Vision Transformers theoretically enable global context modeling via self-attention mechanisms, a computational framework designed for establishing effective long-range dependency modeling, their practical implementation is constrained by quadratic computational complexity in real-world scenarios. To address these challenges, this study proposes a novel scene text detector called GDText-VM (Globally Deformable Text-VMamba), developed using the deformable VMamba framework. This detector incorporates a global channel-spatial attention mechanism along with Fourier contour modeling. This approach enhances the capability to capture long-range dependencies, achieving a global receptive field and rapid convergence while maintaining linear computational complexity. Unlike the original VMamba, GDText-VM integrates deformable convolutions to enhance focus on local regions and reduces reliance on cross-shaped activation patterns. Additionally, to improve the capability of GDText-VM to fit text contours in the Fourier domain, this study introduces an innovative Global Attention Shuffle Module (GASM). This module facilitates the fusion of global channel and spatial features, effectively mitigating the impact of feature imbalance on fitting performance and significantly enhancing text detection accuracy. This study conducts comprehensive experiments on Total-Text, CTW1500, and ICDAR2015 to compare GDText-VM with classical scene text detection approaches. The results indicate that GDText-VM outperforms the state-of-the-art methods in terms of precision, recall, and F-measure, while maintaining efficient computation with 25.88M parameters and 40.83G FLOPs. Notably, GDText-VM achieves F-measure values of 88.5% on Total-Text, 88.9% on CTW1500, and 88.6% on ICDAR2015.

引用