A Dynamic Self-Scheduling Scheme for Heterogeneous Multiprocessor Architectures

被引：71

作者：

Belviranli, Mehmet E. ^{[1
]}

Bhuyan, Laxmi N. ^{[1
]}

Gupta, Rajiv ^{[1
]}

机构：

[1] Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92521 USA

来源：

ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION | 2013年 / 9卷 / 04期

基金：

美国国家科学基金会;

关键词：

Algorithms; Performance; Dynamic self-scheduling; workload balancing; GP-GPUs; FPGAs; GRAPHICS;

D O I：

10.1145/2400682.2400716

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Today's heterogeneous architectures bring together multiple general-purpose CPUs and multiple domain-specific GPUs and FPGAs to provide dramatic speedup for many applications. However, the challenge lies in utilizing these heterogeneous processors to optimize overall application performance by minimizing workload completion time. Operating system and application development for these systems is in their infancy. In this article, we propose a new scheduling and workload balancing scheme, HDSS, for execution of loops having dependent or independent iterations on heterogeneous multiprocessor systems. The new algorithm dynamically learns the computational power of each processor during an adaptive phase and then schedules the remainder of the workload using a weighted self-scheduling scheme during the completion phase. Different from previous studies, our scheme uniquely considers the runtime effects of block sizes on the performance for heterogeneous multiprocessors. It finds the right trade-off between large and small block sizes to maintain balanced workload while keeping the accelerator utilization at maximum. Our algorithm does not require offline training or architecture-specific parameters. We have evaluated our scheme on two different heterogeneous architectures: AMD 64-core Bulldozer system with nVidia Fermi C2050 GPU and Intel Xeon 32-core SGI Altix 4700 supercomputer with Xilinx Virtex 4 FPGAs. The experimental results show that our new scheduling algorithm can achieve performance improvements up to over 200% when compared to the closest existing load balancing scheme. Our algorithm also achieves full processor utilization with all processors completing at nearly the same time which is significantly better than alternative current approaches.

引用

页数：20

共 29 条

[1]

[Anonymous], P 8 ANN ACM S PAR AL

[2]

[Anonymous], 2010, P 16 INT S HIGH PERF

[3]

Augonnet C, 2009, LECT NOTES COMPUT SC, V5704, P863, DOI 10.1007/978-3-642-03869-3_80

[4]

Banicescu I., 2001, Parallel and Distributed Processing Symposium., P791

[5]

BARKER Z., 2005, EFFICIENT HARDWARE D

[6] A performance study of general-purpose applications on graphics processors using CUDA [J].

Che, Shuai ;

Boyer, Michael ;

Meng, Jiayuan ;

Tarjan, David ;

Sheaffer, Jeremy W. ;

Skadron, Kevin .

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2008, 68 (10) :1370-1380

[7]

Chronopoulos A. T., 2001, Proceedings 2001 IEEE International Conference on Cluster Computing, P282, DOI 10.1109/CLUSTR.2001.959989

[8] Distributed loop-scheduling schemes for heterogeneous computer systems [J].

Chronopoulos, Anthony T. ;

Penmatsa, Satish ;

Xu, Jianhua ;

Ali, Siraj .

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2006, 18 (07) :771-785

[9]

CIORBA F., 2006, P 20 INT PAR DISTR P

[10]

de Ruijsscher B, 2006, EMB SYST REAL TIME M, P93

← 1 2 3 →