SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract)

被引：0

作者：

Naik, Atharva ^{[1
]}

Butala, Yash Parag ^{[1
]}

Vaikunthan, Navaneethan ^{[1
]}

Kapoor, Raghav ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21 | 2024年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

When humans are posed with a difficult problem, they often approach it by identifying key skills, honing them, and finally effectively combining them. We propose a novel method and apply it for the VizWiz VQA task to predict the visual skills needed to answer a question, and leverage expert modules to produce intermediary outputs and fuse them in a skill-aware manner. Unlike prior works in visual question-answering (VQA) that use intermediate outputs such as detected objects and Optical Character Recognition (OCR), our approach explicitly guides the model with a skill embedding on what to focus on. While our results show that using skill-aware fusion outperforms skill-unaware models for only a subset of questions, we believe our results provide interesting directions for future work. We also release our code, model, and illustrative demonstrations for future research purposes.

引用

页码：23592 / 23593

页数：2

共 5 条

[1]

Goyal Y., 2016, In- ternational Journal of Computer Vision

[2] VizWiz Grand Challenge: Answering Visual Questions from Blind People [J].

Gurari, Danna ;

Li, Qing ;

Stangl, Abigale J. ;

Guo, Anhong ;

Lin, Chi ;

Grauman, Kristen ;

Luo, Jiebo ;

Bigham, Jeffrey P. .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3608-3617

[3]

Kim W, 2021, PR MACH LEARN RES, V139

[4]

Radford Alec, 2021, PMLR, P8748

[5]

Wang JF, 2022, Arxiv, DOI [arXiv:2205.14100, 10.48550/arXiv.2205.14100]

← 1 →