Alibaba HPN: A Data Center Network for Large Language Model Training

被引:2
作者
Qian, Kun [1 ]
Xi, Yongqing [1 ]
Cao, Jiamin [1 ]
Gao, Jiaqi [1 ]
Xu, Yichi [1 ]
Guan, Yu [1 ]
Fu, Binzhang [1 ]
Shi, Xuemei [1 ]
Zhu, Fangbo [1 ]
Miao, Rui [1 ]
Wang, Chao [1 ]
Wang, Peng [1 ]
Zhang, Pengcheng [1 ]
Zeng, Xianlong [1 ]
Ruan, Eddie [1 ]
Yao, Zhiping [1 ]
Zhai, Ennan [1 ]
Cai, Dennis [1 ]
机构
[1] Alibaba Cloud, Hangzhou, Peoples R China
来源
PROCEEDINGS OF THE 2024 ACM SIGCOMM 2024 CONFERENCE, ACM SIGCOMM 2024 | 2024年
关键词
Network Architecture; AI Infrastructure; Large Language Model; Model Training; Data Center Networks;
D O I
10.1145/3651890.3672265
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents HPN, Alibaba Cloud's data center network for large language model (LLM) training. Due to the differences between LLMs and general cloud computing (e.g., in terms of traffic patterns and fault tolerance), traditional data center networks are not well-suited for LLM training. LLM training produces a small number of periodic, bursty flows (e.g., 400Gbps) on each host. This characteristic of LLM training predisposes Equal-Cost Multi-Path (ECMP) to hash polarization, causing issues such as uneven traffic distribution. HPN introduces a 2-tier, dual-plane architecture capable of interconnecting 15K GPUs within one Pod, typically accommodated by the traditional 3-tier Clos architecture. Such a new architecture design not only avoids hash polarization but also greatly reduces the search space for path selection. Another challenge in LLM training is that its requirement for GPUs to complete iterations in synchronization makes it more sensitive to single-point failure (typically occurring on ToR). HPN proposes a new dual-ToR design to replace the single-ToR in traditional data center networks. HPN has been deployed in our production for more than eight months. We share our experience in designing, and building HPN, as well as the operational lessons of HPN in production.
引用
收藏
页码:691 / 706
页数:16
相关论文
共 71 条
  • [1] A scalable, commodity data center network architecture
    Al-Fares, Mohammad
    Loukissas, Alexander
    Vahdat, Amin
    [J]. ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2008, 38 (04) : 63 - 74
  • [2] CONGA: Distributed Congestion-Aware Load Balancing for Datacenters
    Alizadeh, Mohammad
    Edsall, Tom
    Dharmapurikar, Sarang
    Vaidyanathan, Ramanan
    Chu, Kevin
    Fingerhut, Andy
    Vinh The Lam
    Matus, Francis
    Pan, Rong
    Yadav, Navindra
    Varghese, George
    [J]. ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2014, 44 (04) : 503 - 514
  • [3] [Anonymous], 2024, NVLink and NVSwitch
  • [4] [Anonymous], 2023, Meta's evolution of network for AI-presented by Meta
  • [5] [Anonymous], 2024, The Basics of Heat Pipes-Their History, Principle, and Varieties explained
  • [6] [Anonymous], 2023, What Is Stacking?
  • [7] [Anonymous], 2023, GPT-4 Technical Report
  • [8] [Anonymous], 2024, CloudEngine 16800 Series Data Center Switches
  • [9] [Anonymous], 2024, Dual ToR Evolution: Active-Active ToR Deep Dive
  • [10] [Anonymous], 2024, HPE E5500-Stacking with HUAWEI or H3C Brand Switches