LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models

被引：0

作者：

Kahng, Minsuk ^{[1
]}

Tenney, Ian ^{[2
]}

Pushkarna, Mahima ^{[3
]}

Liu, Michael Xieyang ^{[4
]}

Wexler, James ^{[3
]}

Reif, Emily ^{[2
]}

Kallarackal, Krystal ^{[3
]}

Chang, Minsuk ^{[2
]}

Terry, Michael ^{[3
]}

Dixon, Lucas ^{[5
]}

机构：

[1] Google Res, Atlanta, GA 30322 USA

[2] Google Res, Seattle, WA USA

[3] Google Res, Cambridge, MA USA

[4] Google Res, Pittsburgh, PA USA

[5] Google Res, Paris, France

来源：

EXTENDED ABSTRACTS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, CHI 2024 | 2024年

关键词：

Visual analytics; generative AI; large language models; machine learning evaluation; side-by-side evaluation;

D O I：

10.1145/3613905.3650755

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from large language models (LLMs). However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at Google. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models.

引用

页数：7

共 50 条

[1] LEVA: Using Large Language Models to Enhance Visual Analytics
Zhao, Yuheng
Zhang, Yixing
Zhang, Yu
Zhao, Xinyi
Wang, Junjie
Shao, Zekai
Turkay, Cagatay
Chen, Siming
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2025, 31 (03) : 1830 - 1847
[2] Enrich Humanoids With Large Language Models (LLM)
Antikatzidis, Angelos
Feidakis, Michalis
Marathaki, Konstantina
Toumanidis, Lazaros
Nikolaou, Grigoris
Patrikakis, Charalampos Z.
2024 IEEE GLOBAL ENGINEERING EDUCATION CONFERENCE, EDUCON 2024, 2024,
[3] Evaluation of the Effect of Side Information on LLM Rankers for Recommender Systems
Valera Roman, Adrian
Lozano Murciego, Alvaro
Moreno-Garcia, Maria N.
NEW TRENDS IN DISRUPTIVE TECHNOLOGIES, TECH ETHICS, AND ARTIFICIAL INTELLIGENCE, DITTET 2024, 2024, 1459 : 416 - 422
[4] (Chinavis 2024) TextLens: large language models-powered visual analytics enhancing text clustering
Peng, Ruixiao
Dong, Yu
Li, Guan
Tian, Dong
Shan, Guihua
JOURNAL OF VISUALIZATION, 2025, : 625 - 643
[5] Exploring the neural landscape: Visual analytics of neuron activation in large language models with NeuronautLLM
Woodman, Ollie
Wen, Zhen
Lu, Hui
Ren, Yiwen
Zhu, Minfeng
Chen, Wei
GRAPHICAL MODELS, 2024, 136
[6] 18FDG-PET in 733 consecutive patients with or without side-by-side CT evaluation -: Analysis of 921 lesions
Buell, U
Wieres, FJ
Schneider, W
Reinartz, P
NUKLEARMEDIZIN-NUCLEAR MEDICINE, 2004, 43 (06): : 210 - 216
[7] iScore: Visual Analytics for Interpreting How Language Models Automatically Score Summaries
Coscia, Adam
Holmes, Langdon
Morris, Wesley
Choi, Joon Suh
Crossley, Scott
Endert, Alex
PROCEEDINGS OF 2024 29TH ANNUAL CONFERENCE ON INTELLIGENT USER INTERFACES, IUI 2024, 2024, : 787 - 802
[8] LLM4Eval: Large Language Model for Evaluation in IR
Rahmani, Hossein A.
Siro, Clemencia
Aliannejadi, Mohammad
Craswell, Nick
Clarke, Charles L. A.
Faggioli, Guglielmo
Mitra, Bhaskar
Thomas, Paul
Yilmaz, Emine
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 3040 - 3043
[9] LLM Honeypot: Leveraging Large Language Models as Advanced Interactive Honeypot Systems
Otal, Hakan T.
Canbaz, M. Abdullah
2024 IEEE CONFERENCE ON COMMUNICATIONS AND NETWORK SECURITY, CNS 2024, 2024,
[10] Large language models (LLM) in computational social science: prospects, current state, and challenges
Thapa, Surendrabikram
Shiwakoti, Shuvam
Shah, Siddhant Bikram
Adhikari, Surabhi
Veeramani, Hariram
Nasim, Mehwish
Naseem, Usman
SOCIAL NETWORK ANALYSIS AND MINING, 2025, 15 (01)

← 1 2 3 4 5 →