Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI

被引：0

作者：

Lee, Gyeonggeon ^{[1
,2
]}

Zhai, Xiaoming ^{[2
,3
,4
]}

机构：

[1] Natl Inst Educ, Nat Sci & Sci Educ Dept, Nat Sci & Sci Educ, 1 Nanyang Walk, Singapore 637616, Singapore

[2] Univ Georgia, AI4STEM Educ Ctr, 110 Carlton St, Athens, GA 30602 USA

[3] Univ Georgia, Natl GENIUS Ctr, 110 Carlton St, Athens, GA 30602 USA

[4] Univ Georgia, Dept Math Sci & Social Studies Educ, 110 Carlton St, Athens, GA 30602 USA

来源：

TECHTRENDS | 2025年

基金：

美国国家科学基金会;

关键词：

Artificial intelligence (AI); GPT-4V(ision); Visual question answering; Vision language model; Multimodality;

D O I：

10.1007/s11528-024-01035-z

中图分类号：

G40 [教育学];

学科分类号：

040101 ; 120403 ;

摘要：

Educators and researchers have analyzed various image data acquired from teaching and learning, such as images of learning materials, classroom dynamics, students' drawings, etc. However, this approach is labour-intensive and time-consuming, limiting its scalability and efficiency. The recent development in the Visual Question Answering (VQA) technique has streamlined this process by allowing users to posing questions about the images and receive accurate and automatic answers, both in natural language, thereby enhancing efficiency and reducing the time required for analysis. State-of-the-art Vision Language Models (VLMs) such as GPT-4V(ision) have extended the applications of VQA to a wide range of educational purposes. This report employs GPT-4V as an example to demonstrate the potential of VLM in enabling and advancing VQA for education. Specifically, we demonstrated that GPT-4V enables VQA for educational scholars without requiring technical expertise, thereby reducing accessibility barriers for general users. In addition, we contend that GPT-4V spotlights the transformative potential of VQA for educational research, representing a milestone accomplishment for visual data analysis in education.

引用

页码：271 / 287

页数：17

共 35 条

[31] DSAF: A Dual-Stage Attention Based Multimodal Fusion Framework for Medical Visual Question Answering
K. Mukesh
S. L. Jayaprakash
R. Prasanna Kumar
SN Computer Science, 6 (4)
[32] Application of a Neural Network-based Visual Question Answering System in Preschool Language Education
Cheng Y.
IEIE Transactions on Smart Processing and Computing, 2023, 12 (05) : 419 - 427
[33] Visual question answering model based on the fusion of multimodal features by a two-wav co-attention mechanism
Sharma, Himanshu
Srivastava, Swati
IMAGING SCIENCE JOURNAL, 2021, 69 (1-4) : 177 - 189
[34] LIT-4-RSVQA: LIGHTWEIGHT TRANSFORMER-BASED VISUAL QUESTION ANSWERING IN REMOTE SENSING
Hackel, Leonard
Clasen, Kai Norman
Ravanbakhsh, Mahdyar
Demir, Beguem
IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 2231 - 2234
[35] Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation
Seki, Tomohisa
Kawazoe, Yoshimasa
Ito, Hiromasa
Akagi, Yu
Takiguchi, Toru
Ohe, Kazuhiko
FRONTIERS IN CARDIOVASCULAR MEDICINE, 2025, 12

← 1 2 3 4 →