Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI

被引：0

作者：

Lee, Gyeonggeon ^{[1
,2
]}

Zhai, Xiaoming ^{[2
,3
,4
]}

机构：

[1] Natl Inst Educ, Nat Sci & Sci Educ Dept, Nat Sci & Sci Educ, 1 Nanyang Walk, Singapore 637616, Singapore

[2] Univ Georgia, AI4STEM Educ Ctr, 110 Carlton St, Athens, GA 30602 USA

[3] Univ Georgia, Natl GENIUS Ctr, 110 Carlton St, Athens, GA 30602 USA

[4] Univ Georgia, Dept Math Sci & Social Studies Educ, 110 Carlton St, Athens, GA 30602 USA

来源：

TECHTRENDS | 2025年

基金：

美国国家科学基金会;

关键词：

Artificial intelligence (AI); GPT-4V(ision); Visual question answering; Vision language model; Multimodality;

D O I：

10.1007/s11528-024-01035-z

中图分类号：

G40 [教育学];

学科分类号：

040101 ; 120403 ;

摘要：

Educators and researchers have analyzed various image data acquired from teaching and learning, such as images of learning materials, classroom dynamics, students' drawings, etc. However, this approach is labour-intensive and time-consuming, limiting its scalability and efficiency. The recent development in the Visual Question Answering (VQA) technique has streamlined this process by allowing users to posing questions about the images and receive accurate and automatic answers, both in natural language, thereby enhancing efficiency and reducing the time required for analysis. State-of-the-art Vision Language Models (VLMs) such as GPT-4V(ision) have extended the applications of VQA to a wide range of educational purposes. This report employs GPT-4V as an example to demonstrate the potential of VLM in enabling and advancing VQA for education. Specifically, we demonstrated that GPT-4V enables VQA for educational scholars without requiring technical expertise, thereby reducing accessibility barriers for general users. In addition, we contend that GPT-4V spotlights the transformative potential of VQA for educational research, representing a milestone accomplishment for visual data analysis in education.

引用

页码：271 / 287

页数：17

共 35 条

[1] EduVQA: A multimodal Visual Question Answering framework for smart education
Xiao, Jiongen
Zhang, Zifeng
ALEXANDRIA ENGINEERING JOURNAL, 2025, 122 : 615 - 624
[2] QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document
Mahamoud, Ibrahim Souleiman
Coustaty, Mickael
Joseph, Aurelie
d'Andecy, Vincent Poulain
Ogier, Jean-Marc
DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 659 - 673
[3] An Adaptive Multimodal Fusion Network Based on Multilinear Gradients for Visual Question Answering
Zhao, Chengfang
Tang, Mingwei
Zheng, Yanxi
Ran, Chaocong
ELECTRONICS, 2025, 14 (01):
[4] Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
Li, Haiyan
Han, Dezhi
COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (03) : 1023 - 1040
[5] Multimodal fusion: advancing medical visual question-answering
Mudgal, Anjali
Kush, Udbhav
Kumar, Aditya
Jafari, Amir
Neural Computing and Applications, 2024, 36 (33) : 20949 - 20962
[6] Multimodal Local Perception Bilinear Pooling for Visual Question Answering
Lao, Mingrui
Guo, Yanming
Wang, Hui
Zhang, Xin
IEEE ACCESS, 2018, 6 : 57923 - 57932
[7] Multimodal attention-driven visual question answering for Malayalam
Kovath A.G.
Nayyar A.
Sikha O.K.
Neural Computing and Applications, 2024, 36 (24) : 14691 - 14708
[8] Improving Visual Question Answering by Multimodal Gate Fusion Network
Xiang, Shenxiang
Chen, Qiaohong
Fang, Xian
Guo, Menghao
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[9] Multimodal feature fusion by relational reasoning and attention for visual question answering
Zhang, Weifeng
Yu, Jing
Hu, Hua
Hu, Haiyang
Qin, Zengchang
INFORMATION FUSION, 2020, 55 (55) : 116 - 126
[10] Multimodal Cross-guided Attention Networks for Visual Question Answering
Liu, Haibin
Gong, Shengrong
Ji, Yi
Yang, Jianyu
Xing, Tengfei
Liu, Chunping
PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON COMPUTER MODELING, SIMULATION AND ALGORITHM (CMSA 2018), 2018, 151 : 347 - 353

← 1 2 3 4 →