Well-structured futures and cache locality

被引:0
作者
Herlihy M. [1 ]
Liu Z. [1 ]
机构
[1] Computer Science Department, Brown University, Providence, 02912, RI
来源
| 1600年 / Association for Computing Machinery, 2 Penn Plaza, Suite 701, New York, NY 10121-0701, United States卷 / 02期
基金
美国国家科学基金会;
关键词
Cache locality; Futures; Parallel programming; Performance bounds; Scheduling; Work stealing;
D O I
10.1145/2858650
中图分类号
学科分类号
摘要
In fork-join parallelism, a sequential program is split into a directed acyclic graph of tasks linked by directed dependency edges, and the tasks are executed, possibly in parallel, in an order consistent with their dependencies. A popular and effective way to extend fork-join parallelism is to allow threads to create futures. A thread creates a future to hold the results of a computation, which May or May not be executed in parallel. That result is returned when some thread touches that future, blocking if necessary until the result is ready. Recent research has shown that although futures can, of course, enhance parallelism in a structured way, they can have a deleterious effect on cache locality. In the worst case, futures can incur (PT∞ + tT∞) deviations, which implies (CPT∞ +CtT∞) additional cache misses, where C is the number of cache lines, P is the number of processors, t is the number of touches, and T∞ is the computation span. Since cache locality has a large impact on software performance on modern multicores, this result is troubling. In this article, we show that if futures are used in a simple, disciplined way, then the situation is much better: if each future is touched only once, either by the thread that created it or by a later descendant of the thread that created it, then parallel executions with work stealing can incur at most O(CPT2 ∞) additional cache misses—a substantial improvement. This structured use of futures is characteristic of many (but not all) parallel applications. © 2016 ACM.
引用
收藏
相关论文
共 22 条
  • [1] Acar U.A., Blelloch G.E., Blumofe R.D., The data locality of work stealing, Proceedings of The 12th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’00), pp. 1-12, (2000)
  • [2] Agrawal K., He Y., Leiserson C.E., Adaptive work stealing with parallelism feedback, Proceedings of The 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’07), pp. 112-120, (2007)
  • [3] Arora N.S., Blumofe R.D., Greg Plaxton C., Thread scheduling for multiprogrammed multiprocessors, Proceedings of The 10th Annual ACM Sympzosium on Parallel Algorithms and Zar-Chitectures (SPAA’98), pp. 119-129, (1998)
  • [4] Arvind, Nikhil R.S., Pingali K.K., I-structures: Data structures for parallel computing, ACM Transactions on Programming Languages and Systems, 11, 4, pp. 598-632, (1989)
  • [5] Blelloch G.E., Programming parallel algorithms, Communications of The ACM, 39, 3, pp. 85-97, (1996)
  • [6] Blelloch G.E., Gibbons P.B., Matias Y., Provably efficient scheduling for languages with fine-grained parallelism, Proceedings of The 7th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’95), pp. 1-12, (1995)
  • [7] Blelloch G.E., Reid-Miller M., Pipelining with futures, Proceedings of The 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’97), pp. 249-259, (1997)
  • [8] Blumofe R.D., Frigo M., Joerg C.F., Leiserson C.E., Randall K.H., An analysis of dag-consistent distributed shared-memory algorithms, Proceedings of The 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’96), pp. 297-308, (1996)
  • [9] Blumofe R.D., Joerg C.F., Kuszmaul B.C., Leiserson C.E., Randall K.H., Zhou Y., Cilk: An efficient multithreaded runtime system, Proceedings of The 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP’95), pp. 207-216, (1995)
  • [10] Blumofe R.D., Leiserson C.E., Space-efficient scheduling of multithreaded computations, SIAM Journal on Computing, 27, 1, pp. 202-229, (1998)