Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering

被引：7

作者：

Wang, Yan ^{[1
,2
]}

Li, Peize ^{[3
]}

Si, Qingyi ^{[4
,5
]}

Zhang, Hanwen ^{[4
,5
]}

Zang, Wenyu ^{[6
]}

Lin, Zheng ^{[4
,5
]}

Fu, Peng ^{[4
,5
]}

机构：

[1] Jilin Univ, Coll Comp Sci & Technol, Sch Artificial Intelligence, Changchun 130012, Peoples R China

[2] Jilin Univ, Coll Comp Sci & Technol, Minist Educ, Key Lab Symbol Comp & Knowledge Engn, Changchun 130012, Peoples R China

[3] Jilin Univ, Sch Artificial Intelligence, Changchun 130012, Peoples R China

[4] Chinese Acad Sci, Inst Informat Engn, Beijing 100049, Peoples R China

[5] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing 100049, Peoples R China

[6] China Elect Corp, Beijing 100846, Peoples R China

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2024年 / 20卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Cross-modality relation; external knowledge; visual question answering;

D O I：

10.1145/3618301

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Knowledge-based visual question answering not only needs to answer the questions based on images but also incorporates external knowledge to study reasoning in the joint space of vision and language. To bridge the gap between visual content and semantic cues, it is important to capture the question-related and semantics-rich vision-language connections. Most existing solutions model simple intra-modality relation or represent cross-modality relation using a single vector, which makes it difficult to effectively model complex connections between visual features and question features. Thus, we propose a cross-modality multiple relations learning model, aiming to better enrich cross-modality representations and construct advanced multi-modality knowledge triplets. First, we design a simple yet effective method to generate multiple relations that represent the rich cross-modality relations. The various cross-modality relations link the textual question to the related visual objects. These multi-modality triplets efficiently align the visual objects and corresponding textual answers. Second, to encourage multiple relations to better align with different semantic relations, we further formulate a novel global-local loss. The global loss enables the visual objects and corresponding textual answers close to each other through cross-modality relations in the vision-language space, and the local loss better preserves semantic diversity among multiple relations. Experimental results on the Outside Knowledge VQA and Knowledge-Routed Visual Question Reasoning datasets demonstrate that our model outperforms the state-of-the-art methods.

引用

页数：22

共 53 条

[31]

Perez E, 2018, AAAI CONF ARTIF INTE, P3942

[32] Passage Retrieval for Outside-Knowledge Visual Question Answering [J].

Qu, Chen ;

Zamani, Hamed ;

Yang, Liu ;

Croft, W. Bruce ;

Learned-Miller, Erik .

SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, :1753-1757

[33] YAGO: A Multilingual Knowledge Base from Wikipedia, Wordnet, and Geonames [J].

Rebele, Thomas ;

Suchanek, Fabian ;

Hoffart, Johannes ;

Biega, Joanna ;

Kuzey, Erdal ;

Weikum, Gerhard .

SEMANTIC WEB - ISWC 2016, PT II, 2016, 9982 :177-185

[34] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks [J].

Ren, Shaoqing ;

He, Kaiming ;

Girshick, Ross ;

Sun, Jian .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (06) :1137-1149

[35]

Shevchenko Violetta, 2021, P 3RDWORKSHOP VISION

[36] Weakly-Shared Deep Transfer Networks for Heterogeneous-Domain Knowledge Propagation [J].

Shu, Xiangbo ;

Qi, Guo-Jun ;

Tang, Jinhui ;

Wang, Jingdong .

MM'15: PROCEEDINGS OF THE 2015 ACM MULTIMEDIA CONFERENCE, 2015, :35-44

[37]

Tan H, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P5100

[38]

Tandon N, 2014, AAAI CONF ARTIF INTE, P166

[39] Generalized Deep Transfer Networks for Knowledge Propagation in Heterogeneous Domains [J].

Tang, Jinhui ;

Shu, Xiangbo ;

Li, Zechao ;

Qi, Guo-Jun ;

Wang, Jingdong .

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2016, 12 (04)

[40]

Tian Yonglong, 2020, Advances in Neural Information Processing Systems, V33

← 1 2 3 4 5 6 →