MD-Roofline: A Training Performance Analysis Model for Distributed Deep Learning

被引:2
|
作者
Miao, Tianhao [1 ,2 ]
Wu, Qinghua [1 ,4 ]
Liu, Ting [1 ,2 ]
Cui, Penglai [1 ,2 ]
Ren, Rui [1 ,2 ]
Li, Zhenyu [1 ,4 ]
Xie, Gaogang [3 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Chinese Acad Sci, Comp Network Informat Ctr, Beijing, Peoples R China
[4] Purple Mt Labs, Nanjing, Peoples R China
来源
2022 27TH IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (IEEE ISCC 2022) | 2022年
基金
中国国家自然科学基金;
关键词
Distributed Training Performance; Straggler Diagnosis; Bottleneck Location; Roofline; OPERATIONS;
D O I
10.1109/ISCC55528.2022.9912757
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Due to the bulkiness and sophistication of the Distributed Deep Learning (DDL) systems, it leaves an enormous challenge for AI researchers and operation engineers to analyze, diagnose and locate the performance bottleneck during the training stage. Existing performance models and frameworks gain little insight on the performance reduction that a performance straggler induces. In this paper, we introduce MD-Roofline, a training performance analysis model, which extends the traditional roofline model with communication dimension. The model considers the layer-wise attributes at application level, and a series of achievable peak performance metrics at hardware level. With the assistance of our MD-Roofline, the AI researchers and DDL operation engineers could locate the system bottleneck, which contains three dimensions: intra-GPU computation capacity, intra-GPU memory access bandwidth and inter-GPU communication bandwidth. We demonstrate that our performance analysis model provides great insights in bottleneck analysis when training 12 classic CNNs.
引用
收藏
页数:8
相关论文
共 50 条
  • [41] Early Experiences of Noise-Sensitivity Performance Analysis of a Distributed Deep Learning Framework
    Rojas, Elvis
    Knobloch, Michael
    Daoud, Nour
    Meneses, Esteban
    Mohr, Bernd
    2022 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2022), 2022, : 516 - 522
  • [42] Adaptive Distributed Parallel Training Method for a Deep Learning Model Based on Dynamic Critical Paths of DAG
    Zeng, Yan
    Wang, Wei
    Ding, Yong
    Zhang, Jilin
    Ren, Yongjian
    Yi, Guangzheng
    MATHEMATICS, 2022, 10 (24)
  • [43] FPGA-based tsunami simulation: Performance comparison with GPUs, and roofline model for scalability analysis
    Nagasu, Kohei
    Sano, Kentaro
    Kono, Fumiya
    Nakasato, Naohito
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2017, 106 : 153 - 169
  • [44] COMPARISON OF DEEP LEARNING MODEL PERFORMANCE BETWEEN META-DATASET TRAINING VERSUS DEEP NEURAL ENSEMBLES
    Hurt, J. Alex
    Scott, Grant J.
    Davis, Curt H.
    2019 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2019), 2019, : 1326 - 1329
  • [45] Interpreting a deep reinforcement learning model with conceptual embedding and performance analysis
    Yinglong Dai
    Haibin Ouyang
    Hong Zheng
    Han Long
    Xiaojun Duan
    Applied Intelligence, 2023, 53 : 6936 - 6952
  • [46] Interpreting a deep reinforcement learning model with conceptual embedding and performance analysis
    Dai, Yinglong
    Ouyang, Haibin
    Zheng, Hong
    Long, Han
    Duan, Xiaojun
    APPLIED INTELLIGENCE, 2023, 53 (06) : 6936 - 6952
  • [47] Classification of Continuous ECG Segments - Performance Analysis of a Deep Learning Model
    Barbosa, Luis C. N.
    Lopes, Diogo
    Escrivaes, Ines
    Moreira, Antonio H. J.
    Carvalho, Vitor
    Vilaca, Joao L.
    Morais, Pedro
    2023 45TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY, EMBC, 2023,
  • [48] Distributed Framework for Accelerating Training of Deep Learning Models through Prioritization
    Zhou, Tian
    Gao, Lixin
    2021 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING, IC2E 2021, 2021, : 201 - 209
  • [49] Efficient Flow Scheduling in Distributed Deep Learning Training with Echelon Formation
    Pan, Rui
    Lei, Yiming
    Li, Jialong
    Xie, Zhiqiang
    Yuan, Binhang
    Xia, Yiting
    THE 21ST ACM WORKSHOP ON HOT TOPICS IN NETWORKS, HOTNETS 2022, 2022, : 93 - 100
  • [50] Modeling the Training Iteration Time for Heterogeneous Distributed Deep Learning Systems
    Zeng, Yifu
    Chen, Bowei
    Pan, Pulin
    Li, Kenli
    Chen, Guo
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2023, 2023