Distributed Continuation Stealing is More Scalable than You Might Think

被引:2
作者
Shiina, Shumpei [1 ]
Taura, Kenjiro [1 ]
机构
[1] Univ Tokyo, Tokyo, Japan
来源
2022 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2022) | 2022年
关键词
work stealing; fork-join parallelism; futures; work-first scheduling; task-based runtime systems; RDMA; CILK;
D O I
10.1109/CLUSTER51413.2022.00027
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The need for load balancing in applications with irregular parallelism has motivated research on work stealing. An important choice in work-stealing schedulers is between child stealing or continuation stealing. In child stealing, a newly created task is made stealable by other processors, whereas in continuation stealing, the caller's continuation is made stealable by executing the newly created task first, which preserves the serial execution order. Although the benefits of continuation stealing have been demonstrated on shared memory by Cilk and other runtime systems, it is rarely employed on distributed memory, presumably because it has been thought to be difficult to implement and inefficient as it involves migration of call stacks across nodes. Akiyama and Taura recently introduced efficient RDMA-based continuation stealing, but the practicality of distributed continuation stealing is still unclear because a comparison of its performance with that of child stealing has not previously been performed. This paper presents the results of a comparative performance analysis of continuation stealing and child stealing on distributed memory. To clarify the full potential of continuation stealing, we first investigated various RDMA-based synchronization (task join) implementations, which had not previously been fully investigated. The results revealed that, when the task synchronization pattern was complicated, continuation stealing performed better than child stealing despite its relatively long steal latency due to stack migration. Notably, our runtime system achieved almost perfect scaling on 110,592 cores in an unbalanced tree search (UTS) benchmark. This scalability is comparable to oreven better than that of state-of-the-art bag-of-tasks counterparts.
引用
收藏
页码:129 / 141
页数:13
相关论文
共 68 条
[31]  
K. University, 2017, INTR ITO
[32]  
Kale L., 2019, The Charm++ parallel programming system
[33]  
Kale L.V., 2016, PARALLEL SCI ENG APP, DOI DOI 10.1201/B16251
[34]  
Larkins D. B., 2021, SAWS GITHUB REPOSITO
[35]  
Larkins db, 2020, X10 BENCHMARKS GITHU
[36]   The Cilk plus plus concurrency platform [J].
Leiserson, Charles E. .
JOURNAL OF SUPERCOMPUTING, 2010, 51 (03) :244-257
[37]  
Min S.-J., 2011, P 5 C PARTITIONED GL, P1
[38]  
MOHR E, 1990, PROCEEDINGS OF THE 1990 ACM CONFERENCE ON LISP AND FUNCTIONAL PROGRAMMING, P185, DOI 10.1145/91556.91631
[39]   Massivethreads: A thread library for high productivity languages [J].
Nakashima, Jun ;
Taura, Kenjiro .
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, 8665 :222-238
[40]  
Narlikar G. J., 1998, CMUCS98184