Distributed Continuation Stealing is More Scalable than You Might Think

被引:2
作者
Shiina, Shumpei [1 ]
Taura, Kenjiro [1 ]
机构
[1] Univ Tokyo, Tokyo, Japan
来源
2022 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2022) | 2022年
关键词
work stealing; fork-join parallelism; futures; work-first scheduling; task-based runtime systems; RDMA; CILK;
D O I
10.1109/CLUSTER51413.2022.00027
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The need for load balancing in applications with irregular parallelism has motivated research on work stealing. An important choice in work-stealing schedulers is between child stealing or continuation stealing. In child stealing, a newly created task is made stealable by other processors, whereas in continuation stealing, the caller's continuation is made stealable by executing the newly created task first, which preserves the serial execution order. Although the benefits of continuation stealing have been demonstrated on shared memory by Cilk and other runtime systems, it is rarely employed on distributed memory, presumably because it has been thought to be difficult to implement and inefficient as it involves migration of call stacks across nodes. Akiyama and Taura recently introduced efficient RDMA-based continuation stealing, but the practicality of distributed continuation stealing is still unclear because a comparison of its performance with that of child stealing has not previously been performed. This paper presents the results of a comparative performance analysis of continuation stealing and child stealing on distributed memory. To clarify the full potential of continuation stealing, we first investigated various RDMA-based synchronization (task join) implementations, which had not previously been fully investigated. The results revealed that, when the task synchronization pattern was complicated, continuation stealing performed better than child stealing despite its relatively long steal latency due to stack migration. Notably, our runtime system achieved almost perfect scaling on 110,592 cores in an unbalanced tree search (UTS) benchmark. This scalability is comparable to oreven better than that of state-of-the-art bag-of-tasks counterparts.
引用
收藏
页码:129 / 141
页数:13
相关论文
共 68 条
[1]  
Acar U. A., 2000, SPAA 2000. Twelfth Annual ACM Symposium on Parallel Algorithms and Architectures, P1, DOI 10.1145/341800.341801
[2]  
Akiyama Shigeki, 2016, Journal of Information Processing, V24, P583
[3]  
Akiyama Shigeki, 2015, P 24 INT S HIGH PERF, P15
[4]  
[Anonymous], P ACM 2000 C JAVA GR
[5]  
Antoniu G., 1999, Parallel and Distributed Processing. 11th IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. Proceedings, P496
[6]   Provably efficient scheduling for languages with fine-grained parallelism [J].
Blelloch, GE ;
Gibbons, PB ;
Matias, Y .
JOURNAL OF THE ACM, 1999, 46 (02) :281-321
[7]   Pipelining with futures [J].
Blelloch, GE ;
Reid-Miller, M .
THEORY OF COMPUTING SYSTEMS, 1999, 32 (03) :213-239
[8]   Scheduling multithreaded computations by work stealing [J].
Blumofe, RD ;
Leiserson, CE .
JOURNAL OF THE ACM, 1999, 46 (05) :720-748
[9]   Cilk: An efficient multithreaded runtime system [J].
Blumofe, RD ;
Joerg, CF ;
Kuszmaul, BC ;
Leiserson, CE ;
Randall, KH ;
Zhou, YL .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1996, 37 (01) :55-69
[10]   PARALLEL EVALUATION OF GENERAL ARITHMETIC EXPRESSIONS [J].
BRENT, RP .
JOURNAL OF THE ACM, 1974, 21 (02) :201-206