Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation

被引:0
作者
Liu, Banruo [1 ,2 ]
Ojewale, Mubarak Adetunji [2 ]
Ding, Yuhan [1 ]
Canini, Marco [2 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] KAUST, Thuwal, Saudi Arabia
来源
PROCEEDINGS OF THE 15TH ACM SIGOPS ASIA-PACIFIC WORKSHOP ON SYSTEMS, APSYS 2024 | 2024年
关键词
Distributed Deep Learning Training; Machine Learning Systems; DNN Training Emulation;
D O I
10.1145/3678015.3680478
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose NeuronaBox, a flexible, user-friendly, and high-fidelity approach to emulate DNN training workloads. We argue that to accurately observe performance, it is possible to execute the training workload on a subset of real nodes and emulate the networked execution environment and the collective communication operations. Initial results from a proof-of-concept implementation show that NeuronaBox replicates the behavior of actual systems with high accuracy, with an error margin of less than 1% between the emulated measurements and the real system.
引用
收藏
页码:88 / 94
页数:7
相关论文
共 36 条
  • [1] DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems
    Ardalani, Newsha
    Pal, Saptadeep
    Gupta, Puneet
    [J]. ACM TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS, 2024, 29 (02)
  • [2] Bang J, 2024, Arxiv, DOI arXiv:2312.12391
  • [3] Effectiveness of small group cognitive behavioural therapy for anxiety and depression in Ebola treatment centre staff in Sierra Leone
    Cole, Charles L.
    Waterman, Samantha
    Hunter, Elaine Catherine Margaret
    Bell, Vaughan
    Greenberg, Neil
    Rubin, G. James
    Beck, Alison
    [J]. INTERNATIONAL REVIEW OF PSYCHIATRY, 2021, 33 (1-2) : 189 - 197
  • [4] DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving
    Deng, Wei
    Pan, Junwei
    Zhou, Tian
    Kong, Deguang
    Flores, Aaron
    Lin, Guang
    [J]. WSDM '21: PROCEEDINGS OF THE 14TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2021, : 922 - 930
  • [5] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [6] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
  • [7] Hestness J, 2017, Arxiv, DOI arXiv:1712.00409
  • [8] Huang YP, 2019, ADV NEUR IN, V32
  • [9] Hwang Changho, 2023, MLSys
  • [10] Koupaee M, 2018, Arxiv, DOI arXiv:1810.09305