DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

被引：2

作者：

Maurya, Avinash ^{[1
]}

Underwood, Robert ^{[2
]}

Rafique, M. Mustafa ^{[1
]}

Cappello, Franck ^{[2
]}

Nicolae, Bogdan ^{[2
]}

机构：

[1] Rochester Inst Technol, Rochester, NY 14623 USA

[2] Argonne Natl Lab, Lemont, IL USA

来源：

PROCEEDINGS OF THE 33RD INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, HPDC 2024 | 2024年

基金：

美国国家科学基金会;

关键词：

LLMs and transformers; scalable checkpointing; asynchronous multilevel checkpointing;

D O I：

10.1145/3625549.3658685

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g., failures of components, instability of the software, undesirable learning patterns, etc.), are frequent and typically impact the training in a negative fashion. Thus, LLMs need to be checkpointed frequently so that they can be rolled back to a stable state and subsequently fine-tuned. However, given the large sizes of LLMs, a straightforward checkpointing solution that directly writes the model parameters and optimizer state to persistent storage (e.g., a parallel file system), incurs significant I/O overheads. To address this challenge, in this paper we study how to reduce the I/O overheads for enabling fast and scalable checkpointing for LLMs that can be applied at high frequency (up to the granularity of individual iterations) without significant impact on the training process. Specifically, we introduce a lazy asynchronous multi-level approach that takes advantage of the fact that the tensors making up the model and optimizer state shards remain immutable for extended periods of time, which makes it possible to copy their content in the background with minimal interference during the training process. We evaluate our approach at scales of up to 180 GPUs using different model sizes, parallelism settings, and checkpointing frequencies. The results show up to 48x faster checkpointing and 2.2x faster end-to-end training runtime compared with the state-of-art checkpointing approaches.

引用

页数：13

共 50 条

[1] Enrich Humanoids With Large Language Models (LLM)
Antikatzidis, Angelos
Feidakis, Michalis
Marathaki, Konstantina
Toumanidis, Lazaros
Nikolaou, Grigoris
Patrikakis, Charalampos Z.
2024 IEEE GLOBAL ENGINEERING EDUCATION CONFERENCE, EDUCON 2024, 2024,
[2] Conformer LLM - Convolution Augmented Large Language Models
Vermas, Prateek
SPEECH AND COMPUTER, SPECOM 2024, PT II, 2025, 15300 : 326 - 333
[3] Large language models (LLM) and ChatGPT: a medical student perspective
Arachchige, Arosh S. Perera Molligoda S.
EUROPEAN JOURNAL OF NUCLEAR MEDICINE AND MOLECULAR IMAGING, 2023, 50 (08) : 2248 - 2249
[4] LLM-Pruner: On the Structural Pruning of Large Language Models
Ma, Xinyin
Fang, Gongfan
Wang, Xinchao
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[5] Large language models (LLM) and ChatGPT: a medical student perspective
Arosh S. Perera Molligoda Arachchige
European Journal of Nuclear Medicine and Molecular Imaging, 2023, 50 : 2248 - 2249
[6] Large language models (LLM) and ChatGPT: what will the impact on nuclear medicine be?
Alberts, Ian L.
Mercolli, Lorenzo
Pyka, Thomas
Prenosil, George
Shi, Kuangyu
Rominger, Axel
Afshar-Oromieh, Ali
EUROPEAN JOURNAL OF NUCLEAR MEDICINE AND MOLECULAR IMAGING, 2023, 50 (06) : 1549 - 1552
[7] ST-LLM: Large Language Models Are Effective Temporal Learners
Liu, Ruyang
Li, Chen
Tang, Haoran
Ge, Yixiao
Shan, Ying
Li, Ge
COMPUTER VISION-ECCV 2024, PT LVII, 2025, 15115 : 1 - 18
[8] LLM-PBE: Assessing Data Privacy in Large Language Models
Li, Qinbin
Hong, Junyuan
Xie, Chulin
Tan, Jeffrey
Xin, Rachel
Hou, Junyi
Yin, Xavier
Wang, Zhun
Hendrycks, Dan
Wang, Zhangyang
Li, Bo
He, Bingsheng
Song, Dawn
PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (11): : 3201 - 3214
[9] Large language models (LLM) and ChatGPT: what will the impact on nuclear medicine be?
Ian L. Alberts
Lorenzo Mercolli
Thomas Pyka
George Prenosil
Kuangyu Shi
Axel Rominger
Ali Afshar-Oromieh
European Journal of Nuclear Medicine and Molecular Imaging, 2023, 50 : 1549 - 1552
[10] Generating Troubleshooting Trees for Industrial Equipment using Large Language Models (LLM)
Vidyaratne, Lasitha
Lee, Xian Yeow
Kumar, Aman
Watanabe, Tsubasa
Farahat, Ahmed
Gupta, Chetan
2024 IEEE INTERNATIONAL CONFERENCE ON PROGNOSTICS AND HEALTH MANAGEMENT, ICPHM 2024, 2024, : 116 - 125

← 1 2 3 4 5 →