FA-IATI: A Framework of Frequency Adaptive and Iterative Attention Interaction for Image-Text Matching

被引：1

作者：

Qin, Youxuan ^{[1
]}

Zhao, Jing ^{[1
]}

Li, Ming ^{[2
]}

Sun, Chao ^{[1
]}

机构：

[1] Qilu Univ Technol, ShanDong Acad Sci, Sch Comp Sci & Technol, Jinan, Peoples R China

[2] Shandong Univ Tradit Chinese Med, Sch Intelligence & Informat Engn, Jinan, Peoples R China

来源：

2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2021年

基金：

国家重点研发计划;

关键词：

image-text matching; feature expression; frequency adaptation; attention interaction;

D O I：

10.1109/IJCNN52387.2021.9534069

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The matching relationship between language and vision, which extensively involves various fields such as search engines and social media, is a hot topic that researchers are exploring. Existing matching methods pay more attention to alignment of features and lack the reasoning of high-level semantic concepts, especially the difference in visual expression, inside the modal. Therefore, we propose a frequency adaptive and iterative attention interaction for image-text matching (FA-IATI) framework, starting from the perspective of capturing visual semantic relationships. Specifically, we adaptively aggregate low-frequency and high-frequency signals by using graph convolutional networks to enhance the contextual information between image regions. An attention interaction module generates global features through an iterative mechanism and gradually achieves semantic alignment during the aggregation of words and image regions. Experiments show that our FA-IATI model achieves the best results of 98.4% (R@10) and 94.9% (R@10) on the MS COCO dataset (using 1K testing) compared with the baseline model on text query and image query, respectively. Compared with other current advanced matching models, FA-IATI has superior performance and strong competitiveness.

引用

页数：8

共 30 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[2]

[Anonymous], 2015, Natrue, DOI DOI 10.1038/NATURE14539

[3]

BO D, 2021, ARXIV VOL ABS 2101 0

[4] IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval [J].

Chen, Hui ;

Ding, Guiguang ;

Liu, Xudong ;

Lin, Zijia ;

Liu, Ji ;

Han, Jungong .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12652-12660

[5]

Faghri F., 2018, P BRIT MACH VIS C BM

[6] Recent advances in convolutional neural networks [J].

Gu, Jiuxiang ;

Wang, Zhenhua ;

Kuen, Jason ;

Ma, Lianyang ;

Shahroudy, Amir ;

Shuai, Bing ;

Liu, Ting ;

Wang, Xingxing ;

Wang, Gang ;

Cai, Jianfei ;

Chen, Tsuhan .

PATTERN RECOGNITION, 2018, 77 :354-377

[7]

HINTON GE, 2015, ARXIV VOL ABS 1503 0

[8]

HOBBS JR, 1994, NATURAL LANGUAGE PRO

[9] Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics [J].

Hodosh, Micah ;

Young, Peter ;

Hockenmaier, Julia .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2013, 47 :853-899

[10] Learning Semantic Concepts and Order for Image and Sentence Matching [J].

Huang, Yan ;

Wu, Qi ;

Song, Chunfeng ;

Wang, Liang .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6163-6171

← 1 2 3 →