Model Parameter Prediction Method for Accelerating Distributed DNN Training

被引:0
作者
Liu, Wai-xi [1 ]
Chen, Dao-xiao [3 ]
Tan, Miao-quan [3 ]
Chen, Kong-yang [4 ]
Yin, Yue [3 ]
Shang, Wen-Li [3 ]
Li, Jin [4 ]
Cai, Jun [2 ]
机构
[1] Guangzhou Univ, Dept Elect & Commun Engn, Guangzhou, Peoples R China
[2] Guangdong Polytech Normal Univ, Guangzhou, Peoples R China
[3] Guangzhou Univ, Guangzhou, Peoples R China
[4] Guangzhou Univ, Inst Artificial Intelligence, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Distributed training; Communication optimization; Parameter prediction; COMMUNICATION;
D O I
10.1016/j.comnet.2024.110883
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As the size of deep neural network (DNN) models and datasets increases, distributed training becomes popular to reduce the training time. However, a severe communication bottleneck in distributed training limits its scalability. Many methods aim to address this communication bottleneck by reducing communication traffic, such as gradient sparsification and quantization. However, these methods either are at the expense of losing model accuracy or introducing lots of computing overhead. We have observed that the data distribution between layers of neural network models is similar. Thus, we propose a model parameter prediction method (MP2) to accelerate distributed DNN training under parameter server (PS) framework, where workers push only a subset of model parameters to the PS, and residual model parameters are locally predicted by an already-trained deep neural network model on the PS. We address several key challenges in this approach. First, we build a hierarchical parameters dataset by randomly sampling a subset of model from normal distributed trainings. Second, we design a neural network model with the structure of "convolution + channel attention + Max pooling" for predicting model parameters by using a prediction result-based evaluation method. For VGGNet, ResNet, and AlexNet models on CIFAR10 and CIFAR100 datasets, compared with Baseline, Top-k, deep gradient compression (DGC), and weight nowcaster network (WNN), MP2 can reduce traffic by up to 88.98%; and accelerates the training by up to 47.32% while not losing the model accuracy. MP2 has shown good generalization.
引用
收藏
页数:15
相关论文
共 56 条
  • [1] Abadi M., 2016, arXiv preprint, DOI DOI 10.48550/ARXIV.1603.04467
  • [2] Abdelmoniem Ahmed M, 2021, P MACHINE LEARNING S, V3, P297
  • [3] Addanki Vamsi, 2024, REVERIE LOW PASS FIL
  • [4] Predict Students' Attention in Online Learning Using EEG Data
    Al-Nafjan, Abeer
    Aldayel, Mashael
    [J]. SUSTAINABILITY, 2022, 14 (11)
  • [5] Alistarh Dan, 2016, arXiv preprint arXiv: 1610.02132
  • [6] Bernstein J, 2018, PR MACH LEARN RES, V80
  • [7] Communication-Efficient Distributed Learning: An Overview
    Cao, Xuanyu
    Basar, Tamer
    Diggavi, Suhas
    Eldar, Yonina C.
    Letaief, Khaled B.
    Poor, H. Vincent
    Zhang, Junshan
    [J]. IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2023, 41 (04) : 851 - 873
  • [8] Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning
    Chen, Tianlong
    Liu, Sijia
    Chang, Shiyu
    Cheng, Yu
    Amini, Lisa
    Wang, Zhangyang
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 696 - 705
  • [9] Stochastic Sparse Subspace Clustering
    Chen, Ying
    Li, Chun-Guang
    You, Chong
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4154 - 4163
  • [10] Cheng ZP, 2023, Arxiv, DOI arXiv:2206.01906