Alibaba HPN: A Data Center Network for Large Language Model Training

被引：2

作者：

Qian, Kun ^{[1
]}

Xi, Yongqing ^{[1
]}

Cao, Jiamin ^{[1
]}

Gao, Jiaqi ^{[1
]}

Xu, Yichi ^{[1
]}

Guan, Yu ^{[1
]}

Fu, Binzhang ^{[1
]}

Shi, Xuemei ^{[1
]}

Zhu, Fangbo ^{[1
]}

Miao, Rui ^{[1
]}

Wang, Chao ^{[1
]}

Wang, Peng ^{[1
]}

Zhang, Pengcheng ^{[1
]}

Zeng, Xianlong ^{[1
]}

Ruan, Eddie ^{[1
]}

Yao, Zhiping ^{[1
]}

Zhai, Ennan ^{[1
]}

Cai, Dennis ^{[1
]}

机构：

[1] Alibaba Cloud, Hangzhou, Peoples R China

来源：

PROCEEDINGS OF THE 2024 ACM SIGCOMM 2024 CONFERENCE, ACM SIGCOMM 2024 | 2024年

关键词：

Network Architecture; AI Infrastructure; Large Language Model; Model Training; Data Center Networks;

D O I：

10.1145/3651890.3672265

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper presents HPN, Alibaba Cloud's data center network for large language model (LLM) training. Due to the differences between LLMs and general cloud computing (e.g., in terms of traffic patterns and fault tolerance), traditional data center networks are not well-suited for LLM training. LLM training produces a small number of periodic, bursty flows (e.g., 400Gbps) on each host. This characteristic of LLM training predisposes Equal-Cost Multi-Path (ECMP) to hash polarization, causing issues such as uneven traffic distribution. HPN introduces a 2-tier, dual-plane architecture capable of interconnecting 15K GPUs within one Pod, typically accommodated by the traditional 3-tier Clos architecture. Such a new architecture design not only avoids hash polarization but also greatly reduces the search space for path selection. Another challenge in LLM training is that its requirement for GPUs to complete iterations in synchronization makes it more sensitive to single-point failure (typically occurring on ToR). HPN proposes a new dual-ToR design to replace the single-ToR in traditional data center networks. HPN has been deployed in our production for more than eight months. We share our experience in designing, and building HPN, as well as the operational lessons of HPN in production.

引用

页码：691 / 706

页数：16

共 71 条

[1] A scalable, commodity data center network architecture
Al-Fares, Mohammad
Loukissas, Alexander
Vahdat, Amin
[J]. ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2008, 38 (04) : 63 - 74
[2] CONGA: Distributed Congestion-Aware Load Balancing for Datacenters
Alizadeh, Mohammad
Edsall, Tom
Dharmapurikar, Sarang
Vaidyanathan, Ramanan
Chu, Kevin
Fingerhut, Andy
Vinh The Lam
Matus, Francis
Pan, Rong
Yadav, Navindra
Varghese, George
[J]. ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2014, 44 (04) : 503 - 514
[3] [Anonymous], 2024, NVLink and NVSwitch
[4] [Anonymous], 2023, Meta's evolution of network for AI-presented by Meta
[5] [Anonymous], 2024, The Basics of Heat Pipes-Their History, Principle, and Varieties explained
[6] [Anonymous], 2023, What Is Stacking?
[7] [Anonymous], 2023, GPT-4 Technical Report
[8] [Anonymous], 2024, CloudEngine 16800 Series Data Center Switches
[9] [Anonymous], 2024, Dual ToR Evolution: Active-Active ToR Deep Dive
[10] [Anonymous], 2024, HPE E5500-Stacking with HUAWEI or H3C Brand Switches

← 1 2 3 4 5 6 7 8 →