Multiscale Feature Extraction and Fusion of Image and Text in VQA

被引:188
作者
Lu, Siyu [1 ]
Ding, Yueming [1 ]
Liu, Mingzhe [2 ]
Yin, Zhengtong [3 ]
Yin, Lirong [4 ]
Zheng, Wenfeng [1 ]
机构
[1] Univ Elect Sci & Technol China, Sch Automat, Chengdu 610054, Peoples R China
[2] Wenzhou Univ Technol, Sch Data Sci & Artificial Intelligence, Wenzhou 325000, Peoples R China
[3] Guizhou Univ, Coll Resource & Environm Engn, Guiyang 550025, Peoples R China
[4] Louisiana State Univ, Dept Geog & Anthropol, Baton Rouge, LA 70803 USA
关键词
Multi-scale; Image features; Text information; Feature extraction and fusion; VQA;
D O I
10.1007/s44196-023-00233-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Visual Question Answering (VQA) system is the process of finding useful information from images related to the question to answer the question correctly. It can be widely used in the fields of visual assistance, automated security surveillance, and intelligent interaction between robots and humans. However, the accuracy of VQA has not been ideal, and the main difficulty in its research is that the image features cannot well represent the scene and object information, and the text information cannot be fully represented. This paper used multi-scale feature extraction and fusion methods in the image feature characterization and text information representation sections of the VQA system, respectively to improve its accuracy. Firstly, aiming at the image feature representation problem, multi-scale feature extraction and fusion method were adopted, and the image features output of different network layers were extracted by a pre-trained deep neural network, and the optimal scheme of feature fusion method was found through experiments. Secondly, for the representation of sentences, a multi-scale feature method was introduced to characterize and fuse the word-level, phrase-level, and sentence-level features of sentences. Finally, the VQA model was improved using the multi-scale feature extraction and fusion method. The results show that the addition of multi-scale feature extraction and fusion improves the accuracy of the VQA model.
引用
收藏
页数:11
相关论文
共 43 条
[1]  
Adelson EH, 1984, RCA Engineer, V29, P33
[2]   Neural Module Networks [J].
Andreas, Jacob ;
Rohrbach, Marcus ;
Darrell, Trevor ;
Klein, Dan .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :39-48
[3]  
[Anonymous], 2015, P 28 INT C NEUR INF
[4]  
[Anonymous], 2016, P 2016 C EMP METH NA
[5]   DBpedia: A nucleus for a web of open data [J].
Auer, Soeren ;
Bizer, Christian ;
Kobilarov, Georgi ;
Lehmann, Jens ;
Cyganiak, Richard ;
Ives, Zachary .
SEMANTIC WEB, PROCEEDINGS, 2007, 4825 :722-+
[6]   MUTAN: Multimodal Tucker Fusion for Visual Question Answering [J].
Ben-younes, Hedi ;
Cadene, Remi ;
Cord, Matthieu ;
Thome, Nicolas .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2631-2639
[7]  
Bollacker KD., 2008, P ACM SIGMOD INT C M, P1247, DOI DOI 10.1145/1376616.1376746
[8]   CAAN: Context-Aware attention network for visual question answering [J].
Chen, Chongqing ;
Han, Dezhi ;
Chang, Chin -Chen .
PATTERN RECOGNITION, 2022, 132
[9]  
Chen Y.-S, 2022, 25 DES AUT TEST EUR
[10]  
Chung JY, 2015, PR MACH LEARN RES, V37, P2067