Assessing Inference Time in Large Language Models

被引:1
作者
Walkowiak, Bartosz [1 ]
Walkowiak, Tomasz [1 ]
机构
[1] Wroclaw Univ Sci & Technol, Wroclaw, Poland
来源
SYSTEM DEPENDABILITY-THEORY AND APPLICATIONS, DEPCOS-RELCOMEX 2024 | 2024年 / 1026卷
关键词
Large language models; model deployment; continous batching;
D O I
10.1007/978-3-031-61857-4_29
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models have transformed the field of artificial intelligence, yet they are often associated with elitism and inaccessibility. This is primarily due to the large number of their parameters, ranging from 1 billion to 70 billion, making inference on these models costly and resource intensive. To tackle this challenge, various solutions have emerged with the goal of enabling efficient, fast, and resource-constrained inference. This study aims to review and compare these available solutions. The authors conducted a series of experiments that compared the inference speed of the basic HuggingFace transformers library, the HuggingFace Text Generation Inference server, and the open source vLLM library. The findings reveal that vLLM outperforms the other approaches examined. Additionally, the results highlight how relatively straightforward techniques, such as continuous batching, can significantly accelerate inference for large batch sizes.
引用
收藏
页码:296 / 305
页数:10
相关论文
共 20 条
[1]   DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale [J].
Aminabadi, Reza Yazdani ;
Rajbhandari, Samyam ;
Awan, Ammar Ahmad ;
Li, Cheng ;
Li, Du ;
Zheng, Elton ;
Ruwase, Olatunji ;
Smith, Shaden ;
Zhang, Minjia ;
Rasley, Jeff ;
He, Yuxiong .
SC22: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2022,
[2]  
[Anonymous], 2024, Huggingface Huggingface: RoBERTa
[3]  
[Anonymous], 2024, vllm-project: vLLM: Easy, fast, and cheap LLM serving for everyone
[4]  
Bai Junjie., 2019, Onnx: Open neural network exchange
[5]  
Dao T, 2022, ADV NEUR IN
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]  
Gerganov G, 2023, GGML-tensor library for machine learning
[8]  
Gerganov G., 2023, Inference of LLaMA model in pure C/C++
[9]   Efficient Memory Management for Large Language Model Serving with PagedAttention [J].
Kwon, Woosuk ;
Li, Zhuohan ;
Zhuang, Siyuan ;
Sheng, Ying ;
Zheng, Lianmin ;
Yu, Cody Hao ;
Gonzalez, Joseph E. ;
Zhang, Hao ;
Stoica, Ion .
PROCEEDINGS OF THE TWENTY-NINTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES, SOSP 2023, 2023, :611-626
[10]  
Li SG, 2023, PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, P2391