Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning

被引:0
作者
Zijie Song
Zhenzhen Hu
Richang Hong
机构
[1] Hefei University of Technology,
来源
Multimedia Systems | 2023年 / 29卷
关键词
Visual commonsense reasoning; Knowledge base; Convolution operation; Multi-modal fusion;
D O I
暂无
中图分类号
学科分类号
摘要
Visual commonsense reasoning (VCR) task leads to a cognitive level of understanding between vision and linguistic domains. Three sub-tasks, i.e., Q→A\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q \rightarrow A$$\end{document}, QA→R\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$QA \rightarrow R$$\end{document}, and Q→AR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q \rightarrow AR$$\end{document}, require the ability to predict the correct answer and rational explanation according to the given image and question. Different from other visual reasoning tasks, such as VQA and GQA, VCR focuses on the exploration of the facts that clarify the causes, context, and consequences of the image and questions, which is the process of acquiring knowledge and thorough understanding. In this paper, we propose a rationale knowledge base (RKB) incorporating the convolution fusion mechanism to import the VCR-related knowledge. We emphasize that (1) the RKB is extracted and then trained over VCR’s dataset (VCR-set) itself, and (2) the convolution fusion mechanism is subtly designed to be self-adaptive and computationally efficient. Experiments on the large-scale VCR-set demonstrate the effectiveness of our proposed method with respect to the three sub-tasks.
引用
收藏
页码:3017 / 3026
页数:9
相关论文
共 26 条
  • [1] Liu X(2020)Deep neighborhood component analysis for visual similarity modeling ACM Trans. Intell. Syst. Technol. TIST 11 1-15
  • [2] Yang X(2021)Visual commonsense reasoning with directional visual connections Front. Inf. Technol. Electron. Eng. 22 625-637
  • [3] Wang M(2021)Dual encoding for video retrieval by text IEEE Trans. Pattern Anal. Mach. Intell. 9 750-28
  • [4] Hong R(2020)A survey on knowledge graph embedding: approaches, applications and benchmarks Electronics 67 14-undefined
  • [5] Han Y(2021)KM4: visual reasoning via knowledge embedding memory model with mutual modulation Inf. Fusion undefined undefined-undefined
  • [6] Wu A(2021)Explicit cross-modal representation learning for visual commonsense reasoning IEEE Trans. Multimed. undefined undefined-undefined
  • [7] Zhu L(undefined)undefined undefined undefined undefined-undefined
  • [8] Yang Y(undefined)undefined undefined undefined undefined-undefined
  • [9] Dong J(undefined)undefined undefined undefined undefined-undefined
  • [10] Li X(undefined)undefined undefined undefined undefined-undefined