Fault-tolerant parallel scheduling of tasks on a heterogeneous high-performance workstation cluster

被引:2
作者
Kwok, YK [1 ]
机构
[1] Univ Hong Kong, Dept Elect & Elect Engn, Hong Kong, Hong Kong, Peoples R China
关键词
parallel algorithms; cluster computing; heterogeneous systems; fault-tolerant scheduler; task graphs; neighborhood search;
D O I
10.1023/A:1011186732749
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We propose a new approach, called cluster-based search (CBS), for scheduling large task graphs in parallel on a heterogeneous cluster of workstations connected by a high-speed network (e.g., using an ATM switch at OC-3 speed). The CBS algorithm uses a parallel random neighborhood search which works by refining multiple different initial schedules simultaneously using different workstations. The workstations communicate periodically to exchange their best solutions found thus far in order to direct the search to more promising regions in the search space. Heterogeneity of machines is exploited by the biased partitioning of the search space. The parallel random neighborhood search is fault-tolerant in that the workload of a failed workstation is automatically redistributed to other workstations so that the search can continue. We have implemented the CBS algorithm as a core function of our on-going development of SSI middleware for a Sun workstation cluster.
引用
收藏
页码:299 / 314
页数:16
相关论文
共 46 条
[11]   Design, implementation and performance of fault-tolerant message passing interface (MPI) [J].
Selvakumar, AD ;
Sobha, PM ;
Ravindra, GC ;
Pitchiah, R .
PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS, 2004, :145-150
[12]   Design, implementation and performance of Fault-Tolerant message passing interface (MPI) [J].
Selvakumar, AD ;
Sobha, PM ;
Ravindra, GC ;
Pitchiah, R .
SEVENTH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND GRID IN ASIA PACIFIC REGION, PROCEEDINGS, 2004, :120-129
[13]   FAULT-TOLERANT PARALLEL K-SELECTION ALGORITHM IN N-CUBE NETWORKS [J].
SHEU, JP .
INFORMATION PROCESSING LETTERS, 1991, 39 (02) :93-97
[14]   Fault-tolerant parallel algorithms for adaptive matched-field processing on distributed array systems [J].
Cho, K ;
George, AD ;
Subramaniyan, R .
JOURNAL OF COMPUTATIONAL ACOUSTICS, 2005, 13 (04) :667-687
[15]   High-Performance Broadcasting Algorithms on Cluster [J].
舒继武 ;
魏英霞 ;
王鼎兴 .
Tsinghua Science and Technology, 2004, (01) :30-37
[16]   An efficient adaptive scheduling policy for high-performance computing [J].
Abawajy, J. H. .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2009, 25 (03) :364-370
[18]   Performance-based parallel application toolkit for high-performance clusters [J].
Li, Kuan-Ching ;
Weng, Tien-Hsiung .
JOURNAL OF SUPERCOMPUTING, 2009, 48 (01) :43-65
[19]   Performance-based parallel application toolkit for high-performance clusters [J].
Kuan-Ching Li ;
Tien-Hsiung Weng .
The Journal of Supercomputing, 2009, 48 :43-65
[20]   High-Performance and Flexible Parallel Algorithms for Semisort and Related Problems [J].
Dong, Xiaojun ;
Wu, Yunshu ;
Wang, Zhongqi ;
Dhulipala, Laxman ;
Gu, Yan ;
Sun, Yihan .
PROCEEDINGS OF THE 35TH ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, SPAA 2023, 2023, :341-353