Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning

被引:2
作者
An, Wei [1 ]
Bi, Xiao [1 ]
Chen, Guanting [1 ]
Chen, Shanhuang [1 ]
Deng, Chengqi [1 ]
Ding, Honghui [1 ]
Dong, Kai [1 ]
Du, Qiushi [1 ]
Gao, Wenjun [1 ]
Guan, Kang [1 ]
Guo, Jianzhong [1 ]
Guo, Yonggiang [1 ]
Fu, Zhe [1 ]
He, Ying [1 ]
Huang, Panpan [1 ]
Li, Jiashi [1 ]
Liang, Wenfeng [1 ]
Liu, Xiaodong [1 ]
Liu, Xin [1 ]
Liu, Yiyuan [1 ]
Liu, Yuxuan [1 ]
Lu, Shanghao [1 ]
Lu, Xuan [1 ]
Nie, Xiaotao [1 ]
Pei, Tian [1 ]
Qiu, Junjie [1 ]
Qu, Hui [1 ]
Ren, Zehui [1 ]
Sha, Zhangli [1 ]
Su, Xuecheng [1 ]
Sun, Xiaowen [1 ]
Tan, Yixuan [1 ]
Tang, Minghui [1 ]
Wang, Shiyu [1 ]
Wang, Yaohui [1 ]
Wang, Yongji [1 ]
Xie, Ziwei [1 ]
Xiong, Yiliang [1 ]
Xu, Yanhong [1 ]
Ye, Shengfeng [1 ]
Yu, Shuiping [1 ]
Zha, Yukun [1 ]
Zhang, Liyue [1 ]
Zhang, Haowei [1 ]
Zhang, Mingchuan [1 ]
Zhang, Wentao [1 ]
Zhang, Yichao [1 ]
Zhao, Chenggang [1 ]
Zhao, Yao [1 ]
Zhou, Shangyan [1 ]
机构
[1] DeepSeek AI, Beijing, Peoples R China
来源
SC24: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2024 | 2024年
关键词
High Performance Computing; Cost-Effective; All-Reduce; Best Practices; Deep Learning; Machine Learning; Large Language Models; Artificial Intelligence Infrastructure; MIXTURES;
D O I
10.1109/SC41406.2024.00089
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic hardware-software co-design framework and its best practices. For DL training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. We specifically engineered HFReduce to accelerate allreduce communication and implemented numerous measures to keep our Computation-Storage Integrated Network congestion-free. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication. Our system-oriented experience from DL training provides valuable insights to drive future advancements in AI-HPC.
引用
收藏
页数:23
相关论文
共 100 条
[1]  
A. N. Laboratory, 2023, Argonnes aurora supercomputer
[2]  
[Anonymous], 2022, ChatGPT: Optimizing Language Models for Dialogue
[3]  
[Anonymous], 2023, Introducing Claude
[4]  
api.semanticscholar, 2015, Priority flow control: Build reliable layer 2 infrastructure
[5]  
Bi X, 2024, Arxiv, DOI [arXiv:2401.02954, DOI 10.48550/ARXIV.2401.02954, DOI 10.48550/ARXIV.2304.05332]
[6]  
Brooks Tim., 2024, Video generation models as world simulators
[7]  
Brown TB, 2020, ADV NEUR IN, V33
[8]  
Budruk R., 2003, PCI Express System Architecture, pVI
[9]  
Chowdhery A, 2023, J MACH LEARN RES, V24
[10]  
Crupnicoff D., Deploying Quality of Service and Congestion Control in InfiniBand-based Data Center Networks