From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference

被引:38
作者
Samsi, Siddharth [1 ]
Zhao, Dan [2 ]
McDonald, Joseph [1 ]
Li, Baolin [3 ]
Michaleas, Adam [1 ]
Jones, Michael [1 ]
Bergeron, William [1 ]
Kepner, Jeremy [1 ]
Tiwari, Devesh [3 ]
Gadepally, Vijay [1 ]
机构
[1] MIT, Cambridge, MA 02139 USA
[2] NYU, New York, NY USA
[3] Northeastern Univ, Boston, MA USA
来源
2023 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE, HPEC | 2023年
关键词
Large Language Models; Natural Language Processing; Inference; Green AI; LLM; NLP; Deep Learning; Distributed Computing; Energy; Sustainability;
D O I
10.1109/HPEC58863.2023.10363447
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Large language models (LLMs) have exploded in popularity due to their new generative capabilities that go far beyond prior state-of-the-art. These technologies are increasingly being leveraged in various domains such as law, finance, and medicine. However, these models carry significant computational challenges, especially the compute and energy costs required for inference. Inference energy costs already receive less attention than the energy costs of training LLMs-despite how often these large models are called on to conduct inference in reality (e.g., ChatGPT). As these state-of-the-art LLMs see increasing usage and deployment in various domains, a better understanding of their resource utilization is crucial for cost-savings, scaling performance, efficient hardware usage, and optimal inference strategies. In this paper, we describe experiments conducted to study the computational and energy utilization of inference with LLMs. We benchmark and conduct a preliminary analysis of the inference performance and inference energy costs of different sizes of LLaMA-a recent state-of-the-art LLM-developed by Meta AI on two generations of popular GPUs (NVIDIA V100 & A100) and two datasets (Alpaca and GSM8K) to reflect the diverse set of tasks/benchmarks for LLMs in research and practice. We present the results of multi-node, multi-GPU inference using model sharding across up to 32 GPUs. To our knowledge, our work is the one of the first to study LLM inference performance from the perspective of computational and energy resources at this scale.
引用
收藏
页数:9
相关论文
共 29 条
[1]  
[Anonymous], 2023, Facebook Research
[2]  
[Anonymous], Different development paths of llms
[3]  
[Anonymous], 2023, Nvidia/megatron-lm: Ongoing research training transformer models at scale
[4]  
Cobbe K, 2021, Arxiv, DOI arXiv:2110.14168
[5]   Trends in AI inference energy consumption: Beyond the performance-vs-parameter laws of deep learning [J].
Desislavov, Radosvet ;
Martinez-Plumed, Fernando ;
Hernandez-Orallo, Jose .
SUSTAINABLE COMPUTING-INFORMATICS & SYSTEMS, 2023, 38
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]   Art and the science of generative AI Understanding shifts in creative work will help guide AI's impact on the media ecosystem [J].
Epstein, Ziv ;
Hertzmann, Aaron .
SCIENCE, 2023, 380 (6650) :1110-1111
[8]  
FairScale authors, 2021, Fairscale: A general purpose modular pytorch library for high performance and large scale training
[9]  
Foster D., 2022, Generative Deep Learning
[10]  
Gozalo-Brizuela R., 2023, CHATGPT IS NOT ALL Y