Vision-Language-Knowledge Co-Embedding for Visual Commonsense Reasoning

被引：5

作者：

Lee, JaeYun ^{[1
]}

Kim, Incheol ^{[1
]}

机构：

[1] Kyonggi Univ, Dept Comp Sci, Suwon 16227, South Korea

来源：

SENSORS | 2021年 / 21卷 / 09期

关键词：

visual commonsense reasoning; multimodal co-embedding; knowledge graph; graph convolutional network; pretrained multi-head self-attention network;

D O I：

10.3390/s21092911

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

Visual commonsense reasoning is an intelligent task performed to decide the most appropriate answer to a question while providing the rationale or reason for the answer when an image, a natural language question, and candidate responses are given. For effective visual commonsense reasoning, both the knowledge acquisition problem and the multimodal alignment problem need to be solved. Therefore, we propose a novel Vision-Language-Knowledge Co-embedding (ViLaKC) model that extracts knowledge graphs relevant to the question from an external knowledge base, ConceptNet, and uses them together with the input image to answer the question. The proposed model uses a pretrained vision-language-knowledge embedding module, which co-embeds multimodal data including images, natural language texts, and knowledge graphs into a single feature vector. To reflect the structural information of the knowledge graph, the proposed model uses the graph convolutional neural network layer to embed the knowledge graph first and then uses multi-head self-attention layers to co-embed it with the image and natural language question. The effectiveness and performance of the proposed model are experimentally validated using the VCR v1.0 benchmark dataset.

引用

页数：19

共 12 条

[1] Learning to Agree on Vision Attention for Visual Commonsense Reasoning
Li, Zhenyang
Guo, Yangyang
Wang, Kejie
Liu, Fan
Nie, Liqiang
Kankanhalli, Mohan
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1065 - 1075
[2] Multi-Level Knowledge Injecting for Visual Commonsense Reasoning
Wen, Zhang
Peng, Yuxin
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (03) : 1042 - 1054
[3] A Co-Embedding Model with Variational Auto-Encoder for Knowledge Graphs
Xie, Luodi
Huang, Huimin
Du, Qing
APPLIED SCIENCES-BASEL, 2022, 12 (02):
[4] How to Use Language Expert to Assist Inference for Visual Commonsense Reasoning
Song, Zijie
Hu, Wenbo
Ye, Hao
Hong, Richang
2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 521 - 527
[5] Multi-Modal Structure-Embedding Graph Transformer for Visual Commonsense Reasoning
Zhu, Jian
Wang, Hanli
He, Bin
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1295 - 1305
[6] Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning
Song, Zijie
Hu, Zhenzhen
Hong, Richang
MULTIMEDIA SYSTEMS, 2023, 29 (05) : 3017 - 3026
[7] KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for visual commonsense reasoning®
Song, Dandan
Ma, Siyi
Sun, Zhanchen
Yang, Sicheng
Liao, Lejian
KNOWLEDGE-BASED SYSTEMS, 2021, 230
[8] Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning
Zijie Song
Zhenzhen Hu
Richang Hong
Multimedia Systems, 2023, 29 : 3017 - 3026
[9] Utilizing Language Models to Expand Vision-Based Commonsense Knowledge Graphs
Rezaei, Navid
Reformat, Marek Z.
SYMMETRY-BASEL, 2022, 14 (08):
[10] Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR
Li, Zhenyang
Guo, Yangyang
Wang, Kejie
Chen, Xiaolin
Nie, Liqiang
Kankanhalli, Mohan
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5634 - 5644

← 1 2 →