AutoPaperBench: An MLLM-Based Framework for Automatic Generation of Paper Understanding Evaluation Benchmarks

被引:0
作者
Kim, Min-Woo [1 ]
Park, Hyo-Bin [1 ]
Ahn, Hee-Jin [1 ]
Park, Woo-Ram [1 ]
Jeon, Jae-Wan [1 ]
Lee, Kyong-Ha [2 ]
Lee, Ryong [2 ]
Choi, Dong-Geol [1 ]
机构
[1] Hanbat Natl Univ, Dept Informat & Commun Engn, Daejeon 34158, South Korea
[2] Korea Inst Sci & Technol Informat, Dept Large Scale AI Res Grp, Daejeon 34141, South Korea
来源
ELECTRONICS | 2025年 / 14卷 / 06期
关键词
large language model; deep learning; benchmark; research paper evaluation system;
D O I
10.3390/electronics14061175
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
AutoPaperBench proposes a benchmark generation system to automatically evaluate the comprehensibility of papers in a Multimodal Large Language Model (MLLM). The proposed system efficiently structures the content of a paper through semantic parsing and automatically generates text-based QAs and visual-based VQAs. To ensure the quality of the generated QA, we introduce a reviewer system that evaluates six criteria such as logic and appropriateness. In our experiments on 60 research papers from the medical, natural, and engineering fields, the generated benchmarks demonstrate comparable performance rankings to those of previous benchmarks, and the performance improvements achieved through semantic parsing are validated. The system can run on a single GPU environment and provides a framework for efficiently evaluating LLM thesis comprehension.
引用
收藏
页数:20
相关论文
共 43 条
  • [11] Guo JX, 2018, AAAI CONF ARTIF INTE, P5141
  • [12] PitVQA: Image-Grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery
    He, Runlong
    Xu, Mengya
    Das, Adrito
    Khan, Danyal Z.
    Bano, Sophia
    Marcus, Hani J.
    Stoyanov, Danail
    Clarkson, Matthew J.
    Islam, Mobarakol
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT VI, 2024, 15006 : 488 - 498
  • [13] Hu A., 2024, arXiv
  • [14] Hudson G.T., 2022, arXiv
  • [15] A dataset of clinically generated visual questions and answers about radiology images
    Lau, Jason J.
    Gayen, Soumya
    Ben Abacha, Asma
    Demner-Fushman, Dina
    [J]. SCIENTIFIC DATA, 2018, 5
  • [16] Lewis P, 2020, ADV NEUR IN, V33
  • [17] Li J., 2023, INT C MACHINE LEARNI, V202, P19730, DOI [DOI 10.5555/3618408.3619222, DOI 10.48550/ARXIV.2301.12597]
  • [18] Liao WH, 2024, Arxiv, DOI arXiv:2408.15045
  • [19] Liu H., 2023, arXiv, DOI [arXiv:2310.03744, DOI 10.48550/ARXIV.2310.03744]
  • [20] Lost in the Middle: How Language Models Use Long Contexts
    Liu, Nelson F.
    Lin, Kevin
    Hewitt, John
    Paranjape, Ashwin
    Bevilacqua, Michele
    Petroni, Fabio
    Liang, Percy
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 157 - 173