EDDIS: Accelerating Distributed Data -Parallel DNN Training for Heterogeneous GPU Cluster

被引:0
作者
Ahn, Shinyoung [1 ]
Ahn, Hooyoung [1 ]
Choi, Hyeonseong [2 ]
Lee, Jaehyun [3 ]
机构
[1] ETRI, Supercomp Syst Res Sect, Daejeon, South Korea
[2] MangoBoost, Software Team, Seoul, South Korea
[3] Puzzle AI Inc, Res Inst, Daejeon, South Korea
来源
2024 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW 2024 | 2024年
关键词
Terms Deep Neural Network; DNN; Heterogeneous GPU; Distributed Training; Data -parallel Training; EDDIS; SoftMemoryBox; SSGD; EASGD; Hybrid SGD;
D O I
10.1109/IPDPSW63119.2024.00194
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
EDDIS is a novel distributed deep learning library designed to efficiently utilize heterogeneous GPU resources for training deep neural networks (DNNs), addressing scalability and conuminication challenges in distributed training environments. It offers three training modes (synchronous, asynchronous, and hybrid) and supports TensorFlow and PyTorch frameworks. EDDIS significantly accelerates DNN training in heterogeneous GPU settings, achieving up to 17.5x faster training with 16 nodes compared to a single node. Remarkably, the Hybrid training mode surpasses Horovod, achieving training speeds 2.53 times faster for the ResNet50 model.
引用
收藏
页码:1167 / 1168
页数:2
相关论文
共 7 条
  • [1] SoftMemoryBox II: A Scalable, Shared Memory Buffer Framework for Accelerating Distributed Training of Large-Scale Deep Neural Networks
    Ahn, Shinyoung
    Lim, Eunji
    [J]. IEEE ACCESS, 2020, 8 : 207097 - 207111
  • [2] ShmCaffe: A Distributed Deep Learning Platform with Shared Memory Buffer for HPC Architecture
    Ahn, Shinyoung
    Kim, Joongheon
    Lim, Eunji
    Choi, Wan
    Mohaisen, Aziz
    Kang, Sungwon
    [J]. 2018 IEEE 38TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS), 2018, : 1118 - 1128
  • [3] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [4] Goyal P, 2018, Arxiv, DOI arXiv:1706.02677
  • [5] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
  • [6] ImageNet Classification with Deep Convolutional Neural Networks
    Krizhevsky, Alex
    Sutskever, Ilya
    Hinton, Geoffrey E.
    [J]. COMMUNICATIONS OF THE ACM, 2017, 60 (06) : 84 - 90
  • [7] Liu Y., 2023, Meta-Radiology, V1, P100017, DOI [10.1016/j.metrad.2023.100017, DOI 10.1016/J.METRAD.2023.100017, DOI 10.1016/J.METRAD.2023.1000172]