Reservoir Computing Transformer for Image-Text Retrieval

被引:3
作者
Li, Wenrui [1 ]
Ma, Zhengyu [2 ]
Deng, Liang-Jian [3 ]
Wang, Penghong [1 ]
Shi, Jinqiao [4 ]
Fan, Xiaopeng [1 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Univ Elect Sci & Technol China, Chengdu, Sichuan, Peoples R China
[4] Beijing Univ Posts & Telecommun, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
image-text retrieval; reservoir computing; transformer;
D O I
10.1145/3581783.3611758
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although the attention mechanism in transformers has proven successful in image-text retrieval tasks, most transformer models suffer from a large number of parameters. Inspired by brain circuits that process information with recurrent connected neurons, we propose a novel Reservoir Computing Transformer Reasoning Network (RCTRN) for image-text retrieval. The proposed RCTRN employs a two-step strategy to focus on feature representation and data distribution of different modalities respectively. Specifically, we send visual and textual features through a unified meshed reasoning module, which encodes multi-level feature relationships with prior knowledge and aggregates the complementary outputs in a more effective way. The reservoir reasoning network is proposed to optimize memory connections between features at different stages and address the data distribution mismatch problem introduced by the unified scheme. To investigate the significance of the low power dissipation and low bandwidth characteristics of RRN in practical scenarios, we deployed the model in the wireless transmission system, demonstrating that RRN's optimization of data structures also has a certain robustness against channel noise. Extensive experiments on two benchmark datasets, Flickr30K and MS-COCO, demonstrate the superiority of RCTRN in terms of performance and low-power dissipation compared to state-of-the-art baselines.
引用
收藏
页码:5605 / 5613
页数:9
相关论文
共 57 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]  
[Anonymous], 2018, PROC CVPR IEEE, DOI DOI 10.1109/CVPR.2018.00387
[3]  
Bianchi Filippo Maria, 2021, IEEE TNNLS
[4]  
Catanzaro Bryan, 2019, CVPR
[5]  
Chen Yen-Chun, 2020, ECCV
[6]  
Chen Z., 2022, P IEEE CVF C COMP VI, p13 221
[7]  
Chen Zhuangzhuang, EUR C COMP VIS ECCV
[8]  
Cheng Yuhao, 2022, TOMM
[9]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10]  
Diao Haiwen, 2021, AAAI