Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI

被引:0
作者
Lee, Gyeonggeon [1 ,2 ]
Zhai, Xiaoming [2 ,3 ,4 ]
机构
[1] Natl Inst Educ, Nat Sci & Sci Educ Dept, Nat Sci & Sci Educ, 1 Nanyang Walk, Singapore 637616, Singapore
[2] Univ Georgia, AI4STEM Educ Ctr, 110 Carlton St, Athens, GA 30602 USA
[3] Univ Georgia, Natl GENIUS Ctr, 110 Carlton St, Athens, GA 30602 USA
[4] Univ Georgia, Dept Math Sci & Social Studies Educ, 110 Carlton St, Athens, GA 30602 USA
基金
美国国家科学基金会;
关键词
Artificial intelligence (AI); GPT-4V(ision); Visual question answering; Vision language model; Multimodality;
D O I
10.1007/s11528-024-01035-z
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Educators and researchers have analyzed various image data acquired from teaching and learning, such as images of learning materials, classroom dynamics, students' drawings, etc. However, this approach is labour-intensive and time-consuming, limiting its scalability and efficiency. The recent development in the Visual Question Answering (VQA) technique has streamlined this process by allowing users to posing questions about the images and receive accurate and automatic answers, both in natural language, thereby enhancing efficiency and reducing the time required for analysis. State-of-the-art Vision Language Models (VLMs) such as GPT-4V(ision) have extended the applications of VQA to a wide range of educational purposes. This report employs GPT-4V as an example to demonstrate the potential of VLM in enabling and advancing VQA for education. Specifically, we demonstrated that GPT-4V enables VQA for educational scholars without requiring technical expertise, thereby reducing accessibility barriers for general users. In addition, we contend that GPT-4V spotlights the transformative potential of VQA for educational research, representing a milestone accomplishment for visual data analysis in education.
引用
收藏
页码:271 / 287
页数:17
相关论文
共 35 条
  • [21] Multimodal Bi-direction Guided Attention Networks for Visual Question Answering
    Cai, Linqin
    Xu, Nuoying
    Tian, Hang
    Chen, Kejia
    Fan, Haodu
    NEURAL PROCESSING LETTERS, 2023, 55 (09) : 11921 - 11943
  • [22] DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation
    Zhang, Weifeng
    Yu, Jing
    Zhao, Wenhong
    Ran, Chuan
    INFORMATION FUSION, 2021, 72 : 70 - 79
  • [23] Question guided multimodal receptive field reasoning network for fact-based visual question answering
    Zicheng Zuo
    Yanhan Sun
    Zhenfang Zhu
    Mei Wu
    Hui Zhao
    Multimedia Tools and Applications, 2025, 84 (12) : 11063 - 11078
  • [24] VISUAL QUESTION ANSWERING IN REMOTE SENSING WITH CROSS-ATTENTION AND MULTIMODAL INFORMATION BOTTLENECK
    Songara, Jayesh
    Pande, Shivam
    Choudhury, Shabnam
    Banerjee, Biplab
    Velmurugan, Rajbabu
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 6278 - 6281
  • [25] Adapting Visual Question Answering Models for Enhancing Multimodal Community Q&A Platforms
    Srivastava, Avikalp
    Liu, Hsin-Wen
    Fujita, Sumio
    PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 1421 - 1430
  • [26] A Neuro-Symbolic AI System for Visual Question Answering in Pedestrian Video Sequences
    Park, Jaeil
    Bu, Seok-Jun
    Cho, Sung-Bac
    HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, HAIS 2022, 2022, 13469 : 443 - 454
  • [27] Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study
    Roos, Jonas
    Martin, Ron
    Kaczmarczyk, Robert
    JMIR FORMATIVE RESEARCH, 2024, 8
  • [28] Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data
    Zhu, He
    Togo, Ren
    Ogawa, Takahiro
    Haseyama, Miki
    ELECTRONICS, 2023, 12 (10)
  • [29] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
    Goyal, Yash
    Khot, Tejas
    Agrawal, Aishwarya
    Summers-Stay, Douglas
    Batra, Dhruv
    Parikh, Devi
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (04) : 398 - 414
  • [30] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
    Yash Goyal
    Tejas Khot
    Aishwarya Agrawal
    Douglas Summers-Stay
    Dhruv Batra
    Devi Parikh
    International Journal of Computer Vision, 2019, 127 : 398 - 414