Learning a Mixture of Conditional Gating Blocks for Visual Question Answering

被引:1
作者
Sun, Qiang [1 ,2 ]
Fu, Yan-Wei [3 ]
Xue, Xiang-Yang [4 ]
机构
[1] Shanghai Univ Int Business & Econ, Sch Stat & Informat, Shanghai 201620, Peoples R China
[2] Fudan Univ, Acad Engn & Technol, Shanghai 200433, Peoples R China
[3] Fudan Univ, Sch Data Sci, Shanghai 200433, Peoples R China
[4] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China
基金
中国国家自然科学基金;
关键词
visual question answering; Transformer; dynamic network;
D O I
10.1007/s11390-024-2113-0
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As a Turing test in multimedia, visual question answering (VQA) aims to answer the textual question with a given image. Recently, the "dynamic" property of neural networks has been explored as one of the most promising ways of improving the adaptability, interpretability, and capacity of the neural network models. Unfortunately, despite the prevalence of dynamic convolutional neural networks, it is relatively less touched and very nontrivial to exploit dynamics in the transformers of the VQA tasks through all the stages in an end-to-end manner. Typically, due to the large computation cost of transformers, researchers are inclined to only apply transformers on the extracted high-level visual features for downstream vision and language tasks. To this end, we introduce a question-guided dynamic layer to the transformer as it can effectively increase the model capacity and require fewer transformer layers for the VQA task. In particular, we name the dynamics in the Transformer as Conditional Multi-Head Self-Attention block (cMHSA). Furthermore, our questionguided cMHSA is compatible with conditional ResNeXt block (cResNeXt). Thus a novel model mixture of conditional gating blocks (McG) is proposed for VQA, which keeps the best of the Transformer, convolutional neural network (CNN), and dynamic networks. The pure conditional gating CNN model and the conditional gating Transformer model can be viewed as special examples of McG. We quantitatively and qualitatively evaluate McG on the CLEVR and VQA-Abstract datasets. Extensive experiments show that McG has achieved the state-of-the-art performance on these benchmark datasets.
引用
收藏
页码:912 / 928
页数:17
相关论文
共 58 条
[1]  
Anderson P, 2018, PROC CVPR IEEE, P6077, DOI [10.1002/ett.70087, 10.1109/CVPR.2018.00636]
[2]  
Andreas J., 2016, LEARNING COMPOSE NEU, DOI DOI 10.18653/V1/N16-1181
[3]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[4]   MUTAN: Multimodal Tucker Fusion for Visual Question Answering [J].
Ben-younes, Hedi ;
Cadene, Remi ;
Cord, Matthieu ;
Thome, Nicolas .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2631-2639
[5]   Dynamic Routing Networks [J].
Cai, Shaofeng ;
Shu, Yao ;
Wang, Wei .
2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, :3587-3596
[6]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[7]   Counterfactual Samples Synthesizing for Robust Visual Question Answering [J].
Chen, Long ;
Yan, Xin ;
Xiao, Jun ;
Zhang, Hanwang ;
Pu, Shiliang ;
Zhuang, Yueting .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10797-10806
[8]   Dynamic ReLU [J].
Chen, Yinpeng ;
Dai, Xiyang ;
Liu, Mengchen ;
Chen, Dongdong ;
Yuan, Lu ;
Liu, Zicheng .
COMPUTER VISION - ECCV 2020, PT XIX, 2020, 12364 :351-367
[9]  
Chung J., 2014, Empirical evaluation of gated recurrent neural networks on sequence modeling, DOI [10.3115, DOI 10.3115/V1/W14-4012]
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171