A combined priority scheduling method for distributed machine learning

被引：2

作者：

Du, TianTian ^{[1
]}

Xiao, GongYi ^{[1
]}

Chen, Jing ^{[1
]}

Zhang, ChuanFu ^{[1
]}

Sun, Hao ^{[1
]}

Li, Wen ^{[1
]}

Geng, YuDong ^{[1
]}

机构：

[1] Qilu Univ Technol, Shandong Acad Sci, Shandong Comp Sci Ctr, Natl Supercomp Ctr Jinan,Shandong Prov Key Lab Com, Jinan, Peoples R China

来源：

EURASIP JOURNAL ON WIRELESS COMMUNICATIONS AND NETWORKING | 2023年 / 2023卷 / 01期

关键词：

Cloud computing; Distributed machine learning; Resource scheduling; Prioritization; ALLOCATION;

D O I：

10.1186/s13638-023-02253-4

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Algorithms and frameworks for distributed machine learning have been widely used in numerous artificial intelligence engineering applications. A cloud platform provides a large number of resources at a lower cost and is a more convenient method for such applications. With the rapid development of containerization, native cloud combinations based on Docker and Kubernetes have provided effective resource support for distributed machine learning. However, native Kubernetes does not provide efficient priority or fair resource scheduling strategies for distributed machine learning in computationally intensive and time-consuming jobs, which easily leads to resource deadlock, resource waste, and low job execution efficiency. Therefore, to utilize the execution order between multiple jobs in distributed machine learning as well as the dependencies between multiple tasks for the same job, considering intra- and inter-group scheduling priorities, a combined priority scheduling method is proposed for distributed machine learning based on Kubernetes and Volcano. Considering the user priority, task priority, longest wait time, task parallelism, and affinity and non-affinity between the parameter server and worker nodes, a combined priority scheduling model of inter- and intra-job priority is proposed, which is mapped into a scheduling strategy of inter- and intra-group priorities of pods, enabling the efficient scheduling and training of distributed machine learning. The experiment results show that the proposed method achieves preferential resource allocation for urgent, high parallelism, and high-priority jobs with high-priority users and improves the job execution efficiency. The affinity and anti-affinity settings among pods reduce the time of information interaction between the parameter server and worker nodes to a certain extent, thereby improving the job completion efficiency. This group scheduling strategy alleviates the problems of resource deadlock and waste caused by insufficient resources in cloud computing.

引用

页数：24

共 41 条

[1] AN APPROXIMATION ALGORITHM FOR MAX-MIN FAIR ALLOCATION OF INDIVISIBLE GOODS [J].

Asadpour, Arash ;

Saberi, Amin .

SIAM JOURNAL ON COMPUTING, 2010, 39 (07) :2970-2989

[2] A novel multiclass priority algorithm for task scheduling in cloud computing [J].

Ben Alla, Hicham ;

Ben Alla, Said ;

Ezzati, Abdellah ;

Touhafi, Abdellah .

JOURNAL OF SUPERCOMPUTING, 2021, 77 (10) :11514-11555

[3] Diagnose Parkinson?s disease and cleft lip and palate using deep convolutional neural networks evolved by IP-based chimp optimization algorithm [J].

Chen, Feng ;

Yang, Chunyan ;

Khishe, Mohammad .

BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2022, 77

[4]

[陈亮 CHEN Liang], 2010, [计算机工程与科学, Computer Engineering and Science], V31, P101

[5] Multi-objective heuristics algorithm for dynamic resource scheduling in the cloud computing environment [J].

Devi, K. Lalitha ;

Valli, S. .

JOURNAL OF SUPERCOMPUTING, 2021, 77 (08) :8252-8280

[6]

Dewangan Bhupesh Kumar, 2019, Procedia Computer Science, V152, P204, DOI 10.1016/j.procs.2019.05.044

[7]

Fu YQ, 2019, IEEE INT CONF BIG DA, P278, DOI 10.1109/BigData47090.2019.9006427

[8] Bidirectional resource scheduling algorithm for advanced long term evolution system [J].

Gatti, Ravi ;

Shankar, Shiva .

ENGINEERING REPORTS, 2020, 2 (07)

[9]

Gengsheng Zheng, 2021, Journal of Physics: Conference Series, V1848, DOI 10.1088/1742-6596/1848/1/012008

[10] Chic: Experience-driven Scheduling in Machine Learning Clusters [J].

Gong, Yifan ;

Li, Baochun ;

Liang, Ben ;

Zhan, Zheng .

PROCEEDINGS OF THE IEEE/ACM INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE (IWQOS 2019), 2019,

← 1 2 3 4 5 →