The Importance of Workload Choice in Evaluating LLM Inference Systems

被引：1

作者：

Papaioannou, Konstantinos ^{[1
]}

Doudali, Thaleia Dimitra ^{[1
]}

机构：

[1] Univ Politecn Madrid, IMDEA Software Inst, Madrid, Spain

来源：

PROCEEDINGS OF THE 2024 4TH WORKSHOP ON MACHINE LEARNING AND SYSTEMS, EUROMLSYS 2024 | 2024年

关键词：

Large Language Models; Inference; Machine Learning; KV Cache;

D O I：

10.1145/3642970.3655823

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The success of Large Language Models (LLMs) across a wide range of applications and use cases has created the need for faster and more scalable systems for LLM inference. These systems speed up LLM inference by optimizing scheduling decisions or efficiently managing the available memory. However, most of them use synthetic datasets and target latencycritical scenarios in their evaluation, thereby overlooking a considerable part of real-world use cases and workloads. As a response, this paper presents an extensive experimental evaluation that aims to capture the impact of the workload used for evaluation and quantify the benefit derived from higher memory availability. Our analysis shows that LLMs can achieve 3x higher throughput for text generation and question-answering use cases compared to text summarization and conversational ones. The latter ones seem to exhibit low levels of performance due to their demanding input sizes. In addition, non-latency-critical inference services achieve 2.3x higher throughput when 4x more memory is available. In conclusion, this paper aims to highlight the importance and impact of the chosen workloads in the evaluation of systems for LLM inference.

引用

页码：39 / 46

页数：8

共 46 条

[1]

Agrawal A, 2023, Arxiv, DOI [arXiv:2308.16369, 10.48550/ARXIV.2308.16369]

[2]

AI Poem Generator, 2024, AI Poem Generator

[3]

Amazon Web Services, 2024, Amazon codewhisperer natural language to bash translation

[4]

Amazon Web Services, 2024, Amazon Comprehend

[5]

[Anonymous], 2019, PACT Care BV

[6]

Brown TB, 2020, ADV NEUR IN, V33

[7] How many words do we read per minute? A review and meta-analysis of reading rate [J].

Brysbaert, Marc .

JOURNAL OF MEMORY AND LANGUAGE, 2019, 109

[8]

Choi S, 2022, PROCEEDINGS OF THE 2022 USENIX ANNUAL TECHNICAL CONFERENCE, P199

[9]

Conover M., 2023, Free dolly: Introducing the world's first truly open instruction-tuned llm

[10]

DeepL, 2024, DeepL translator

← 1 2 3 4 5 →