MD-Roofline: A Training Performance Analysis Model for Distributed Deep Learning

被引：2

作者：

Miao, Tianhao ^{[1
,2
]}

Wu, Qinghua ^{[1
,4
]}

Liu, Ting ^{[1
,2
]}

Cui, Penglai ^{[1
,2
]}

Ren, Rui ^{[1
,2
]}

Li, Zhenyu ^{[1
,4
]}

Xie, Gaogang ^{[3
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

[3] Chinese Acad Sci, Comp Network Informat Ctr, Beijing, Peoples R China

[4] Purple Mt Labs, Nanjing, Peoples R China

来源：

2022 27TH IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (IEEE ISCC 2022) | 2022年

基金：

中国国家自然科学基金;

关键词：

Distributed Training Performance; Straggler Diagnosis; Bottleneck Location; Roofline; OPERATIONS;

D O I：

10.1109/ISCC55528.2022.9912757

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Due to the bulkiness and sophistication of the Distributed Deep Learning (DDL) systems, it leaves an enormous challenge for AI researchers and operation engineers to analyze, diagnose and locate the performance bottleneck during the training stage. Existing performance models and frameworks gain little insight on the performance reduction that a performance straggler induces. In this paper, we introduce MD-Roofline, a training performance analysis model, which extends the traditional roofline model with communication dimension. The model considers the layer-wise attributes at application level, and a series of achievable peak performance metrics at hardware level. With the assistance of our MD-Roofline, the AI researchers and DDL operation engineers could locate the system bottleneck, which contains three dimensions: intra-GPU computation capacity, intra-GPU memory access bandwidth and inter-GPU communication bandwidth. We demonstrate that our performance analysis model provides great insights in bottleneck analysis when training 12 classic CNNs.

引用

页数：8

共 50 条

[41] Early Experiences of Noise-Sensitivity Performance Analysis of a Distributed Deep Learning Framework
Rojas, Elvis
Knobloch, Michael
Daoud, Nour
Meneses, Esteban
Mohr, Bernd
2022 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2022), 2022, : 516 - 522
[42] Adaptive Distributed Parallel Training Method for a Deep Learning Model Based on Dynamic Critical Paths of DAG
Zeng, Yan
Wang, Wei
Ding, Yong
Zhang, Jilin
Ren, Yongjian
Yi, Guangzheng
MATHEMATICS, 2022, 10 (24)
[43] FPGA-based tsunami simulation: Performance comparison with GPUs, and roofline model for scalability analysis
Nagasu, Kohei
Sano, Kentaro
Kono, Fumiya
Nakasato, Naohito
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2017, 106 : 153 - 169
[44] COMPARISON OF DEEP LEARNING MODEL PERFORMANCE BETWEEN META-DATASET TRAINING VERSUS DEEP NEURAL ENSEMBLES
Hurt, J. Alex
Scott, Grant J.
Davis, Curt H.
2019 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2019), 2019, : 1326 - 1329
[45] Interpreting a deep reinforcement learning model with conceptual embedding and performance analysis
Yinglong Dai
Haibin Ouyang
Hong Zheng
Han Long
Xiaojun Duan
Applied Intelligence, 2023, 53 : 6936 - 6952
[46] Interpreting a deep reinforcement learning model with conceptual embedding and performance analysis
Dai, Yinglong
Ouyang, Haibin
Zheng, Hong
Long, Han
Duan, Xiaojun
APPLIED INTELLIGENCE, 2023, 53 (06) : 6936 - 6952
[47] Classification of Continuous ECG Segments - Performance Analysis of a Deep Learning Model
Barbosa, Luis C. N.
Lopes, Diogo
Escrivaes, Ines
Moreira, Antonio H. J.
Carvalho, Vitor
Vilaca, Joao L.
Morais, Pedro
2023 45TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY, EMBC, 2023,
[48] Distributed Framework for Accelerating Training of Deep Learning Models through Prioritization
Zhou, Tian
Gao, Lixin
2021 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING, IC2E 2021, 2021, : 201 - 209
[49] Efficient Flow Scheduling in Distributed Deep Learning Training with Echelon Formation
Pan, Rui
Lei, Yiming
Li, Jialong
Xie, Zhiqiang
Yuan, Binhang
Xia, Yiting
THE 21ST ACM WORKSHOP ON HOT TOPICS IN NETWORKS, HOTNETS 2022, 2022, : 93 - 100
[50] Modeling the Training Iteration Time for Heterogeneous Distributed Deep Learning Systems
Zeng, Yifu
Chen, Bowei
Pan, Pulin
Li, Kenli
Chen, Guo
INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2023, 2023

← 1 2 3 4 5 →