Hierarchical Feature Aggregation Based on Transformer for Image-Text Matching

被引：37

作者：

Dong, Xinfeng ^{[1
]}

Zhang, Huaxiang ^{[1
,2
]}

Zhu, Lei ^{[1
]}

Nie, Liqiang ^{[3
]}

Liu, Li ^{[1
]}

机构：

[1] Shandong Normal Univ, Sch Informat Sci & Engn, Jinan 250358, Peoples R China

[2] Shandong Jiaotong Univ, Sch Informat Sci & Elect Engn, Jinan 250357, Peoples R China

[3] Shandong Univ, Sch Comp Sci & Technol, Jinan 250100, Shandong, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2022年 / 32卷 / 09期

基金：

中国国家自然科学基金;

关键词：

Transformers; Semantics; Feature extraction; Bit error rate; Visualization; Image reconstruction; Correlation; Image-text alignment; graph convolutional network; transformer model; GRAPH ATTENTION; NETWORK;

D O I：

10.1109/TCSVT.2022.3164230

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In order to carry out more accurate retrieval across image-text modalities, some scholars use fine-grained feature to align image and text. Most of them directly use attention mechanism to align image regions and words in the sentence, and ignore the fact that semantics related to an object is abstract and cannot be accurately expressed by object information alone. To overcome this weakness, we propose a hierarchical feature aggregation algorithm based on graph convolutional networks (GCN) to facilitate object semantic integrity by integrating attributes of an object and relations between objects hierarchically in both image and text modalities. In order to eliminate the semantic gap between modalities, we propose a cross-modal feature fusion method based on transformer to generate modal-specific feature representations by integrating both the object feature and global feature from the other modality. Then we map the fusion feature into a common space. Experiment results on the most frequently-used datasets MSCOCO and Flickr30K show the effectiveness of the proposed model compared with the latest methods.

引用

页码：6437 / 6447

页数：11

共 57 条

[1]

Ahmad WU, 2021, AAAI CONF ARTIF INTE, V35, P12462

[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[3] SPICE: Semantic Propositional Image Caption Evaluation [J].

Anderson, Peter ;

Fernando, Basura ;

Johnson, Mark ;

Gould, Stephen .

COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398

[4]

Bai H, 2021, AAAI CONF ARTIF INTE, V35, P12526

[5]

Bruna J, 2014, Arxiv, DOI [arXiv:1312.6203, DOI 10.48550/ARXIV.1312.6203]

[6] Explore-Exploit Graph Traversal for Image Retrieval [J].

Chang, Cheng ;

Yu, Guangwei ;

Liu, Chundi ;

Volkovs, Maksims .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9415-9423

[7] IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval [J].

Chen, Hui ;

Ding, Guiguang ;

Liu, Xudong ;

Lin, Zijia ;

Liu, Ji ;

Han, Jungong .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12652-12660

[8]

Cho K., 2014, COMPUT SCI

[9] Triplet-Based Deep Hashing Network for Cross-Modal Retrieval [J].

Deng, Cheng ;

Chen, Zhaojia ;

Liu, Xianglong ;

Gao, Xinbo ;

Tao, Dacheng .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2018, 27 (08) :3893-3903

[10]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

← 1 2 3 4 5 6 →