DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

被引:2
|
作者
Maurya, Avinash [1 ]
Underwood, Robert [2 ]
Rafique, M. Mustafa [1 ]
Cappello, Franck [2 ]
Nicolae, Bogdan [2 ]
机构
[1] Rochester Inst Technol, Rochester, NY 14623 USA
[2] Argonne Natl Lab, Lemont, IL USA
来源
PROCEEDINGS OF THE 33RD INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, HPDC 2024 | 2024年
基金
美国国家科学基金会;
关键词
LLMs and transformers; scalable checkpointing; asynchronous multilevel checkpointing;
D O I
10.1145/3625549.3658685
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g., failures of components, instability of the software, undesirable learning patterns, etc.), are frequent and typically impact the training in a negative fashion. Thus, LLMs need to be checkpointed frequently so that they can be rolled back to a stable state and subsequently fine-tuned. However, given the large sizes of LLMs, a straightforward checkpointing solution that directly writes the model parameters and optimizer state to persistent storage (e.g., a parallel file system), incurs significant I/O overheads. To address this challenge, in this paper we study how to reduce the I/O overheads for enabling fast and scalable checkpointing for LLMs that can be applied at high frequency (up to the granularity of individual iterations) without significant impact on the training process. Specifically, we introduce a lazy asynchronous multi-level approach that takes advantage of the fact that the tensors making up the model and optimizer state shards remain immutable for extended periods of time, which makes it possible to copy their content in the background with minimal interference during the training process. We evaluate our approach at scales of up to 180 GPUs using different model sizes, parallelism settings, and checkpointing frequencies. The results show up to 48x faster checkpointing and 2.2x faster end-to-end training runtime compared with the state-of-art checkpointing approaches.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Enrich Humanoids With Large Language Models (LLM)
    Antikatzidis, Angelos
    Feidakis, Michalis
    Marathaki, Konstantina
    Toumanidis, Lazaros
    Nikolaou, Grigoris
    Patrikakis, Charalampos Z.
    2024 IEEE GLOBAL ENGINEERING EDUCATION CONFERENCE, EDUCON 2024, 2024,
  • [2] Conformer LLM - Convolution Augmented Large Language Models
    Vermas, Prateek
    SPEECH AND COMPUTER, SPECOM 2024, PT II, 2025, 15300 : 326 - 333
  • [3] Large language models (LLM) and ChatGPT: a medical student perspective
    Arachchige, Arosh S. Perera Molligoda S.
    EUROPEAN JOURNAL OF NUCLEAR MEDICINE AND MOLECULAR IMAGING, 2023, 50 (08) : 2248 - 2249
  • [4] LLM-Pruner: On the Structural Pruning of Large Language Models
    Ma, Xinyin
    Fang, Gongfan
    Wang, Xinchao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Large language models (LLM) and ChatGPT: a medical student perspective
    Arosh S. Perera Molligoda Arachchige
    European Journal of Nuclear Medicine and Molecular Imaging, 2023, 50 : 2248 - 2249
  • [6] Large language models (LLM) and ChatGPT: what will the impact on nuclear medicine be?
    Alberts, Ian L.
    Mercolli, Lorenzo
    Pyka, Thomas
    Prenosil, George
    Shi, Kuangyu
    Rominger, Axel
    Afshar-Oromieh, Ali
    EUROPEAN JOURNAL OF NUCLEAR MEDICINE AND MOLECULAR IMAGING, 2023, 50 (06) : 1549 - 1552
  • [7] ST-LLM: Large Language Models Are Effective Temporal Learners
    Liu, Ruyang
    Li, Chen
    Tang, Haoran
    Ge, Yixiao
    Shan, Ying
    Li, Ge
    COMPUTER VISION-ECCV 2024, PT LVII, 2025, 15115 : 1 - 18
  • [8] LLM-PBE: Assessing Data Privacy in Large Language Models
    Li, Qinbin
    Hong, Junyuan
    Xie, Chulin
    Tan, Jeffrey
    Xin, Rachel
    Hou, Junyi
    Yin, Xavier
    Wang, Zhun
    Hendrycks, Dan
    Wang, Zhangyang
    Li, Bo
    He, Bingsheng
    Song, Dawn
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (11): : 3201 - 3214
  • [9] Large language models (LLM) and ChatGPT: what will the impact on nuclear medicine be?
    Ian L. Alberts
    Lorenzo Mercolli
    Thomas Pyka
    George Prenosil
    Kuangyu Shi
    Axel Rominger
    Ali Afshar-Oromieh
    European Journal of Nuclear Medicine and Molecular Imaging, 2023, 50 : 1549 - 1552
  • [10] Generating Troubleshooting Trees for Industrial Equipment using Large Language Models (LLM)
    Vidyaratne, Lasitha
    Lee, Xian Yeow
    Kumar, Aman
    Watanabe, Tsubasa
    Farahat, Ahmed
    Gupta, Chetan
    2024 IEEE INTERNATIONAL CONFERENCE ON PROGNOSTICS AND HEALTH MANAGEMENT, ICPHM 2024, 2024, : 116 - 125