Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding

被引：10

作者：

Zhao, Heng ^{[1
]}

Zhou, Joey Tianyi ^{[1
]}

Ong, Yew-Soon ^{[1
,2
]}

机构：

[1] A STAR Ctr Frontier AI Res CFAR, Singapore 138632, Singapore

[2] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2024年 / 35卷 / 02期

关键词：

Cross-attention; deep learning; multimodal; referring expression comprehension; visual grounding;

D O I：

10.1109/TNNLS.2022.3183827

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current one-stage methods for visual grounding encode the language query as one holistic sentence embedding before fusion with visual features for target localization. Such a formulation provides insufficient ability to model query at the word level, and therefore is prone to neglect words that may not be the most important ones for a sentence but are critical for the referred object. In this article, we propose Word2Pix: a one-stage visual grounding network based on the encoder-decoder transformer architecture that enables learning for textual to visual feature correspondence via word to pixel attention. Each word from the query sentence is given an equal opportunity when attending to visual pixels through multiple stacks of transformer decoder layers. In this way, the decoder can learn to model the language query and fuse language with the visual features for target prediction simultaneously. We conduct the experiments on RefCOCO, RefCOCO+, and RefCOCOg datasets, and the proposed Word2Pix outperforms the existing one-stage methods by a notable margin. The results obtained also show that Word2Pix surpasses the two-stage visual grounding models, while at the same time keeping the merits of the one-stage paradigm, namely, end-to-end training and fast inference speed. Code is available at https:// github.com/azurerain7/Word2Pix.

引用

页码：1523 / 1533

页数：11

共 50 条

[1] Bidirectional feature fusion via cross-attention transformer for chrysanthemum classification
Chen, Yifan
Yang, Xichen
Yan, Hui
Liu, Jia
Jiang, Jian
Mao, Zhongyuan
Wang, Tianshu
PATTERN ANALYSIS AND APPLICATIONS, 2025, 28 (02)
[2] Fully Cross-Attention Transformer for Guided Depth Super-Resolution
Ariav, Ido
Cohen, Israel
SENSORS, 2023, 23 (05)
[3] Deformable Cross-Attention Transformer for Medical Image Registration
Chen, Junyu
Liu, Yihao
He, Yufan
Du, Yong
MACHINE LEARNING IN MEDICAL IMAGING, MLMI 2023, PT I, 2024, 14348 : 115 - 125
[4] Unsupervised Domain Adaptation via Bidirectional Cross-Attention Transformer
Wang, Xiyu
Guo, Pengxin
Zhang, Yu
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT V, 2023, 14173 : 309 - 325
[5] Learning Cross-Attention Point Transformer With Global Porous Sampling
Duan, Yueqi
Sun, Haowen
Yan, Juncheng
Lu, Jiwen
Zhou, Jie
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6283 - 6297
[6] Spatial-Spectral Transformer With Cross-Attention for Hyperspectral Image Classification
Peng, Yishu
Zhang, Yuwen
Tu, Bing
Li, Qianming
Li, Wujing
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[7] Hierarchical cross-modal contextual attention network for visual grounding
Xin Xu
Gang Lv
Yining Sun
Yuxia Hu
Fudong Nian
Multimedia Systems, 2023, 29 : 2073 - 2083
[8] Hierarchical cross-modal contextual attention network for visual grounding
Xu, Xin
Lv, Gang
Sun, Yining
Hu, Yuxia
Nian, Fudong
MULTIMEDIA SYSTEMS, 2023, 29 (04) : 2073 - 2083
[9] An efficient object tracking based on multi-head cross-attention transformer
Dai, Jiahai
Li, Huimin
Jiang, Shan
Yang, Hongwei
EXPERT SYSTEMS, 2025, 42 (02)
[10] Vision transformer with feature calibration and selective cross-attention for brain tumor classification
Mohammad Ali Labbaf Khaniki
Marzieh Mirzaeibonehkhater
Mohammad Manthouri
Elham Hasani
Iran Journal of Computer Science, 2025, 8 (2) : 335 - 347

← 1 2 3 4 5 →