Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning

被引：0

作者：

Zijie Song

Zhenzhen Hu

Richang Hong

机构：

[1] Hefei University of Technology,

来源：

Multimedia Systems | 2023年 / 29卷

关键词：

Visual commonsense reasoning; Knowledge base; Convolution operation; Multi-modal fusion;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Visual commonsense reasoning (VCR) task leads to a cognitive level of understanding between vision and linguistic domains. Three sub-tasks, i.e., Q→A\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q \rightarrow A$$\end{document}, QA→R\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$QA \rightarrow R$$\end{document}, and Q→AR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q \rightarrow AR$$\end{document}, require the ability to predict the correct answer and rational explanation according to the given image and question. Different from other visual reasoning tasks, such as VQA and GQA, VCR focuses on the exploration of the facts that clarify the causes, context, and consequences of the image and questions, which is the process of acquiring knowledge and thorough understanding. In this paper, we propose a rationale knowledge base (RKB) incorporating the convolution fusion mechanism to import the VCR-related knowledge. We emphasize that (1) the RKB is extracted and then trained over VCR’s dataset (VCR-set) itself, and (2) the convolution fusion mechanism is subtly designed to be self-adaptive and computationally efficient. Experiments on the large-scale VCR-set demonstrate the effectiveness of our proposed method with respect to the three sub-tasks.

引用

页码：3017 / 3026

页数：9

共 26 条

[1] Liu X(2020)Deep neighborhood component analysis for visual similarity modeling ACM Trans. Intell. Syst. Technol. TIST 11 1-15
[2] Yang X(2021)Visual commonsense reasoning with directional visual connections Front. Inf. Technol. Electron. Eng. 22 625-637
[3] Wang M(2021)Dual encoding for video retrieval by text IEEE Trans. Pattern Anal. Mach. Intell. 9 750-28
[4] Hong R(2020)A survey on knowledge graph embedding: approaches, applications and benchmarks Electronics 67 14-undefined
[5] Han Y(2021)KM4: visual reasoning via knowledge embedding memory model with mutual modulation Inf. Fusion undefined undefined-undefined
[6] Wu A(2021)Explicit cross-modal representation learning for visual commonsense reasoning IEEE Trans. Multimed. undefined undefined-undefined
[7] Zhu L(undefined)undefined undefined undefined undefined-undefined
[8] Yang Y(undefined)undefined undefined undefined undefined-undefined
[9] Dong J(undefined)undefined undefined undefined undefined-undefined
[10] Li X(undefined)undefined undefined undefined undefined-undefined

← 1 2 3 →