GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

被引：1

作者：

Guo, Cong ^{[1
]}

Zhang, Rui ^{[2
]}

Xu, Jiale ^{[1
]}

Leng, Jingwen ^{[1
]}

Liu, Zihan ^{[1
]}

Huang, Ziyu ^{[1
]}

Guo, Minyi ^{[1
]}

Wu, Hao ^{[2
]}

Zhao, Shouren ^{[2
]}

Zhao, Junping ^{[2
]}

Zhang, Ke ^{[2
]}

机构：

[1] Shanghai Jiao Tong Univ, Shanghai Qi Zhi Inst, Shanghai, Peoples R China

[2] Ant Grp, Hangzhou, Peoples R China

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, ASPLOS 2024, VOL 2 | 2024年

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

Memory Defragmentation; GPU; Deep Learning; Virtual Memory Stitching;

D O I：

10.1145/3620665.3640423

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large-scale deep neural networks (DNNs), such as large language models (LLMs), have revolutionized the artificial intelligence (AI) field and become increasingly popular. However, training or fine-tuning such models requires substantial computational power and resources, where the memory capacity of a single acceleration device like a GPU is one of the most important bottlenecks. Owing to the prohibitively large overhead (e.g., 10x) of GPUs' native memory allocator, DNN frameworks like PyTorch and TensorFlow adopt a caching allocator that maintains a memory pool with a splitting mechanism for fast memory (de)allocation. Unfortunately, the caching allocator's efficiency degrades quickly for popular memory reduction techniques such as recomputation, offloading, distributed training, and low-rank adaptation. The primary reason is that those memory reduction techniques introduce frequent and irregular memory (de)allocation requests, leading to severe fragmentation problems for the splitting-based caching allocator. To mitigate this fragmentation problem, we propose a novel memory allocation framework based on low-level GPU virtual memory management called GPU memory lake (GMLake). GMLake employs a novel virtual memory stitching (VMS) mechanism, which can fuse or combine non-contiguous memory blocks with a virtual memory address mapping. GMLake can reduce average of 9.2 GB (up to 25 GB) GPU memory usage and 15% (up to 33%) fragmentation among eight LLM models on GPU A100 with 80 GB memory. GMLake is completely transparent to the DNN models and memory reduction techniques and ensures the seamless execution of resource-intensive deep-learning tasks. We have opensourced GMLake at https://github.com/intelligent-machinelearning/glake/tree/main/GMLake.

引用

页码：450 / 466

页数：17

共 92 条

[11] Chiang W.-L., 2023, Vicuna: An open-source chatbot impressing gpt-4 with 90
[12] LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference
Choi, Yujeong
Kim, Yunseong
Rhu, Minsoo
[J]. 2021 27TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2021), 2021, : 493 - 506
[13] Choquette J., 2020, 2020 IEEE HOT CHIPS, P1
[14] Dao T., 2022, NeurIPS
[15] Dao T, 2023, Arxiv, DOI [arXiv:2307.08691, DOI 10.48550/ARXIV.2307.08691]
[16] DeepSpeed, 2023, Zero documentation
[17] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[18] Du ZX, 2022, Arxiv, DOI arXiv:2103.10360
[19] Gorman Mel, 2006, OTT LIN S CIT, V1, P369
[20] Guan Y, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P7275

← 1 2 3 4 5 6 7 8 9 10 →