Adversarial Multimodal Network for Movie Story Question Answering

被引：15

作者：

Yuan, Zhaoquan ^{[1
]}

Sun, Siyuan ^{[2
,3
]}

Duan, Lixin ^{[2
,3
]}

Li, Changsheng ^{[4
]}

Wu, Xiao ^{[1
]}

Xu, Changsheng ^{[5
]}

机构：

[1] Southwest Jiaotong Univ, Sch Informat Sci & Technol, Chengdu 610031, Peoples R China

[2] Univ Elect Sci & Technol China, Big Data Res Ctr, Chengdu 610051, Peoples R China

[3] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 610051, Peoples R China

[4] Beijing Inst Technol, Sch Comp Sci & Technol, Beijing 100081, Peoples R China

[5] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2021年 / 23卷

基金：

中国国家自然科学基金;

关键词：

Knowledge discovery; Motion pictures; Visualization; Task analysis; Generators; Gallium nitride; Natural languages; Movie question answering; adversarial network; multimodal understanding;

D O I：

10.1109/TMM.2020.3002667

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Visual question answering by using information from multiple modalities has attracted more and more attention in recent years. However, it is a very challenging task, as the visual content and natural language have quite different statistical properties. In this work, we present a method called Adversarial Multimodal Network (AMN) to better understand video stories for question answering. In AMN, we propose to learn multimodal feature representations by finding a more coherent subspace for video clips and the corresponding texts (e.g., subtitles and questions) based on generative adversarial networks. Moreover, a self-attention mechanism is developed to enforce our newly introduced consistency constraint in order to preserve the self-correlation between the visual cues of the original video clips in the learned multimodal representations. Extensive experiments on the benchmark MovieQA and TVQA datasets show the effectiveness of our proposed AMN over other published state-of-the-art methods.

引用

页码：1744 / 1756

页数：13

共 50 条

[1] Progressive Attention Memory Network for Movie Story Question Answering
Kim, Junyeong
Ma, Minuk
Kim, Kyungsu
Kim, Sungjin
Yoo, Chang D.
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8329 - 8338
[2] Multimodal Dual Attention Memory for Video Story Question Answering
Kim, Kyung-Min
Choi, Seong-Ho
Kim, Jin-Hwa
Zhang, Byoung-Tak
COMPUTER VISION - ECCV 2018, PT 15, 2018, 11219 : 698 - 713
[3] From text to multimodal: a survey of adversarial example generation in question answering systems
Yigit, Gulsum
Amasyali, Mehmet Fatih
KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (12) : 7165 - 7204
[4] Holistic Multi-Modal Memory Network for Movie Question Answering
Wang, Anran
Anh Tuan Luu
Foo, Chuan-Sheng
Zhu, Hongyuan
Tay, Yi
Chandrasekhar, Vijay
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 489 - 499
[5] A Universal Quaternion Hypergraph Network for Multimodal Video Question Answering
Guo, Zhicheng
Zhao, Jiaxuan
Jiao, Licheng
Liu, Xu
Liu, Fang
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 38 - 49
[6] Improving Visual Question Answering by Multimodal Gate Fusion Network
Xiang, Shenxiang
Chen, Qiaohong
Fang, Xian
Guo, Menghao
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[7] Multimodal Graph Transformer for Multimodal Question Answering
He, Xuehai
Wang, Xin Eric
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 189 - 200
[8] Multimodal Graph Transformer for Multimodal Question Answering
He, Xuehai
Wang, Xin Eric
EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 2023, : 189 - 200
[9] Multimodal Graph Transformer for Multimodal Question Answering
He, Xuehai
Wang, Xin Eric
arXiv, 2023,
[10] Multimodal Attention for Visual Question Answering
Kodra, Lorena
Mece, Elinda Kajo
INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792

← 1 2 3 4 5 →