LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models

被引:0
|
作者
Kahng, Minsuk [1 ]
Tenney, Ian [2 ]
Pushkarna, Mahima [3 ]
Liu, Michael Xieyang [4 ]
Wexler, James [3 ]
Reif, Emily [2 ]
Kallarackal, Krystal [3 ]
Chang, Minsuk [2 ]
Terry, Michael [3 ]
Dixon, Lucas [5 ]
机构
[1] Google Res, Atlanta, GA 30322 USA
[2] Google Res, Seattle, WA USA
[3] Google Res, Cambridge, MA USA
[4] Google Res, Pittsburgh, PA USA
[5] Google Res, Paris, France
来源
EXTENDED ABSTRACTS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, CHI 2024 | 2024年
关键词
Visual analytics; generative AI; large language models; machine learning evaluation; side-by-side evaluation;
D O I
10.1145/3613905.3650755
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from large language models (LLMs). However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at Google. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] LEVA: Using Large Language Models to Enhance Visual Analytics
    Zhao, Yuheng
    Zhang, Yixing
    Zhang, Yu
    Zhao, Xinyi
    Wang, Junjie
    Shao, Zekai
    Turkay, Cagatay
    Chen, Siming
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2025, 31 (03) : 1830 - 1847
  • [2] Enrich Humanoids With Large Language Models (LLM)
    Antikatzidis, Angelos
    Feidakis, Michalis
    Marathaki, Konstantina
    Toumanidis, Lazaros
    Nikolaou, Grigoris
    Patrikakis, Charalampos Z.
    2024 IEEE GLOBAL ENGINEERING EDUCATION CONFERENCE, EDUCON 2024, 2024,
  • [3] Evaluation of the Effect of Side Information on LLM Rankers for Recommender Systems
    Valera Roman, Adrian
    Lozano Murciego, Alvaro
    Moreno-Garcia, Maria N.
    NEW TRENDS IN DISRUPTIVE TECHNOLOGIES, TECH ETHICS, AND ARTIFICIAL INTELLIGENCE, DITTET 2024, 2024, 1459 : 416 - 422
  • [4] (Chinavis 2024) TextLens: large language models-powered visual analytics enhancing text clustering
    Peng, Ruixiao
    Dong, Yu
    Li, Guan
    Tian, Dong
    Shan, Guihua
    JOURNAL OF VISUALIZATION, 2025, : 625 - 643
  • [5] Exploring the neural landscape: Visual analytics of neuron activation in large language models with NeuronautLLM
    Woodman, Ollie
    Wen, Zhen
    Lu, Hui
    Ren, Yiwen
    Zhu, Minfeng
    Chen, Wei
    GRAPHICAL MODELS, 2024, 136
  • [6] 18FDG-PET in 733 consecutive patients with or without side-by-side CT evaluation -: Analysis of 921 lesions
    Buell, U
    Wieres, FJ
    Schneider, W
    Reinartz, P
    NUKLEARMEDIZIN-NUCLEAR MEDICINE, 2004, 43 (06): : 210 - 216
  • [7] iScore: Visual Analytics for Interpreting How Language Models Automatically Score Summaries
    Coscia, Adam
    Holmes, Langdon
    Morris, Wesley
    Choi, Joon Suh
    Crossley, Scott
    Endert, Alex
    PROCEEDINGS OF 2024 29TH ANNUAL CONFERENCE ON INTELLIGENT USER INTERFACES, IUI 2024, 2024, : 787 - 802
  • [8] LLM4Eval: Large Language Model for Evaluation in IR
    Rahmani, Hossein A.
    Siro, Clemencia
    Aliannejadi, Mohammad
    Craswell, Nick
    Clarke, Charles L. A.
    Faggioli, Guglielmo
    Mitra, Bhaskar
    Thomas, Paul
    Yilmaz, Emine
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 3040 - 3043
  • [9] LLM Honeypot: Leveraging Large Language Models as Advanced Interactive Honeypot Systems
    Otal, Hakan T.
    Canbaz, M. Abdullah
    2024 IEEE CONFERENCE ON COMMUNICATIONS AND NETWORK SECURITY, CNS 2024, 2024,
  • [10] Large language models (LLM) in computational social science: prospects, current state, and challenges
    Thapa, Surendrabikram
    Shiwakoti, Shuvam
    Shah, Siddhant Bikram
    Adhikari, Surabhi
    Veeramani, Hariram
    Nasim, Mehwish
    Naseem, Usman
    SOCIAL NETWORK ANALYSIS AND MINING, 2025, 15 (01)