Visual Question Answering via Combining Inferential Attention and Semantic Space Mapping

被引：14

作者：

Liu, Yun ^{[1
]}

Zhang, Xiaoming ^{[2
,3
]}

Huang, Feiran ^{[4
]}

Zhou, Zhibo ^{[1
]}

Zhao, Zhonghua ^{[5
]}

Li, Zhoujun ^{[6
]}

机构：

[1] Beihang Univ, Beijing Key Lab Network Technol, Beijing 100191, Peoples R China

[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China

[3] Beihang Univ, Hefei Innovat Res Inst, Hefei 230012, Peoples R China

[4] Jinan Univ, Coll Informat Sci & Technol, Coll Cyber Secur, Guangzhou 510632, Peoples R China

[5] Coordinat Ctr China, Natl Comp Network Emergency Response Tech Team, Beijing 100029, Peoples R China

[6] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China

来源：

KNOWLEDGE-BASED SYSTEMS | 2020年 / 207卷

基金：

北京市自然科学基金; 中国国家自然科学基金;

关键词：

Visual Question Answering; Inferential attention; Semantic space mapping; NETWORKS;

D O I：

10.1016/j.knosys.2020.106339

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual Question Answering (VQA) has emerged and aroused widespread interest in recent years. Its purpose is to explore the close correlations between the image and question for answer inference. We have two observations about the VQA task: (1) the number of newly defined answers is ever-growing, which means that answer prediction on pre-defined labeled answers may lead to errors, as some unlabeled answers may be the right choice to the question-image pairs; (2) in the process of answering visual questions, the gradual change of human attention has an important guiding role in exploring the correlations between images and questions. Based on these observations, we propose a novel model for VQA, i.e., combining Inferential Attention and Semantic Space Mapping (IASSM). Specifically, our model has two salient aspects: (1) a semantic space shared by both the labeled and unlabeled answers is constructed to learn new answers, where the joint embedding of a question and the corresponding image is mapped and clustered around the answer exemplar; (2) a novel inferential attention model is designed to simulate the learning process of human attention to explore the correlations between the image and question. It focuses on the more important question words and image regions associated with the question. Both the inferential attention and the semantic space mapping modules are integrated into an end-to-end framework to infer the answer. Experiments performed on two public VQA datasets and our newly constructed dataset show the superiority of IASSM compared with existing methods. (C) 2020 Elsevier B.V. All rights reserved.

引用

页数：12

共 50 条

[1] Medical visual question answering via corresponding feature fusion combined with semantic attention
Zhu, Han
He, Xiaohai
Wang, Meiling
Zhang, Mozhi
Qing, Linbo
MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2022, 19 (10) : 10192 - 10212
[2] Semantic Text Recognition via Visual Question Answering
Beltran, Viviana
Journet, Nicholas
Coustaty, Mickael
Doucet, Antoine
2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION WORKSHOPS (ICDARW), VOL 5, 2019, : 97 - 102
[3] REINFORCEMENT STACKED LEARNING WITH SEMANTIC-ASSOCIATED ATTENTION FOR VISUAL QUESTION ANSWERING
Xiao, Xinyu
Zhang, Chunxia
Xiang, Shiming
Pan, Chunhong
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4170 - 4174
[4] An Improved Attention for Visual Question Answering
Rahman, Tanzila
Chou, Shih-Han
Sigal, Leonid
Carenini, Giuseppe
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1653 - 1662
[5] Combining semantic information in question answering systems
Moreda, Paloma
Llorens, Hector
Saquete, Estela
Palomar, Manuel
INFORMATION PROCESSING & MANAGEMENT, 2011, 47 (06) : 870 - 885
[6] Differential Attention for Visual Question Answering
Patro, Badri
Namboodiri, Vinay P.
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7680 - 7688
[7] Multimodal Attention for Visual Question Answering
Kodra, Lorena
Mece, Elinda Kajo
INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
[8] Fusing Attention with Visual Question Answering
Burt, Ryan
Cudic, Mihael
Principe, Jose C.
2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 949 - 953
[9] Robust visual question answering via semantic cross modal augmentation
Mashrur, Akib
Luo, Wei
Zaidi, Nayyar A.
Robles-Kelly, Antonio
COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 238
[10] R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering
Lu, Pan
Ji, Lei
Zhang, Wei
Duan, Nan
Zhou, Ming
Wang, Jianyong
KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, : 1880 - 1889

← 1 2 3 4 5 →