Self-Adaptive Neural Module Transformer for Visual Question Answering

被引:48
作者
Zhong, Huasong [1 ]
Chen, Jingyuan [1 ]
Shen, Chen [1 ]
Zhang, Hanwang [2 ]
Huang, Jianqiang [1 ]
Hua, Xian-Sheng [1 ]
机构
[1] Alibaba Grp, Dept DAMO Acad, Hangzhou 311121, Peoples R China
[2] Nanyang Technol Univ, Singapore 639798, Singapore
关键词
Layout; Cognition; Task analysis; Visualization; Neural networks; Knowledge discovery; Decoding; Visual question answering; neural module transformer; multi modal; self-adaptive;
D O I
10.1109/TMM.2020.2995278
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and its variants rely on parsing the natural language question into a module layout (i.e., a problem-solving program). In particular, this process follows a feedforward encoder-decoder pipeline: the encoder embeds the question into a static vector and the decoder generates the layout. However, we argue that such conventional encoder-decoder neglects the dynamic nature of question comprehension (i.e., we should attend to different words from step to step) and per-module intermediate results (i.e., we should discard module performing badly) in the reasoning steps. In this paper, we present a novel NMN, called Self-Adaptive Neural Module Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout decoding by considering intermediate Q&A results. Specifically, we encode the intermediate results with the given question features by a novel transformer module to generate dynamic question feature embedding which evolves over reasoning steps. Besides, the transformer utilizes the intermediate results from each reasoning step to guide subsequent layout arrangement. Extensive experimental evaluations demonstrate the superiority of the proposed SANMT over NMN and its variants on four challenging benchmarks, including CLEVR, CLEVR-CoGenT, VQAv1.0, and VQAv2.0 (on average the relative improvement over NMN are 1.5, 2.3, 0.7 and 0.5 points with respect to accuracy).
引用
收藏
页码:1264 / 1273
页数:10
相关论文
共 31 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]   Neural Module Networks [J].
Andreas, Jacob ;
Rohrbach, Marcus ;
Darrell, Trevor ;
Klein, Dan .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :39-48
[3]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[4]   Alcohol drinking and the risk of colorectal adenoma: a dose-response meta-analysis [J].
Ben, Qiwen ;
Wang, Lifu ;
Liu, Jun ;
Qian, Aihua ;
Wang, Qi ;
Yuan, Yaozong .
EUROPEAN JOURNAL OF CANCER PREVENTION, 2015, 24 (04) :286-295
[5]   Optimized Adaptive Streaming of Multi-video Stream Bundles [J].
Carlsson, Niklas ;
Eager, Derek ;
Krishnamoorthi, Vengatanathan ;
Polishchuk, Tatiana .
IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (07) :1637-1653
[6]   Enhancing Visual Question Answering Using Dropout [J].
Fang, Zhiwei ;
Liu, Jing ;
Qiao, Yanyuan ;
Tang, Qu ;
Li, Yong ;
Lu, Hanqing .
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, :1002-1010
[7]   Dynamic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering [J].
Gao, Peng ;
Jiang, Zhengkai ;
You, Haoxuan ;
Lu, Pan ;
Hoi, Steven ;
Wang, Xiaogang ;
Li, Hongsheng .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :6632-6641
[8]   Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering [J].
Goyal, Yash ;
Khot, Tejas ;
Summers-Stay, Douglas ;
Batra, Dhruv ;
Parikh, Devi .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6325-6334
[9]   Dancelets Mining for Video Recommendation Based on Dance Styles [J].
Han, Tingting ;
Yao, Hongxun ;
Xu, Chenliang ;
Sun, Xiaoshuai ;
Zhang, Yanhao ;
Corso, Jason J. .
IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (04) :712-724
[10]   Stochastic Multiview Hashing for Large-Scale Near-Duplicate Video Retrieval [J].
Hao, Yanbin ;
Mu, Tingting ;
Hong, Richang ;
Wang, Meng ;
An, Ning ;
Goulermas, John Y. .
IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (01) :1-14