An Efficient Programming Skeleton for Clusters of Multi-Core Processors

被引:0
作者
Mina Hosseini Rad
Ahmad Patooghy
Mahdi Fazeli
机构
[1] Iran University of Science and Technology,Department of Computer Engineering
[2] Institute for Research in Fundamental Sciences (IPM),School of Computer Science
来源
International Journal of Parallel Programming | 2018年 / 46卷
关键词
Cluster computing; Divide and conquer; Multi-core processor; Parallel programming; Skeleton;
D O I
暂无
中图分类号
学科分类号
摘要
This paper proposes a divide and conquer skeleton which aids parallel system programmers by (1) reducing programming complexity, (2) shortening programming time, and (3) enhancing code efficiency. To do this, the proposed skeleton exploits three mechanisms of (1) work-stealing, and (2) communication/computation overlapping, and (3) architectural awareness in the proposed divide and conquer skeleton. Using the work-stealing mechanism, when a processing element reaches a low-load condition, the processing core fetches a new job from the waiting queue of other cores. The second mechanism uses special threads to enable the proposed skeleton to overlapping computations with communications. The third mechanism considers the architectural parameters of the host system e.g., size of L1 cache, network bandwidth, network latency to maximally match a divide and conquer problem with the proposed skeleton. To evaluate the proposed skeleton, three benchmarks of merge sort, fast Fourier transform, and standard matrix multiplication are developed by the proposed skeleton as well as customized programming. Experiments are done in both simulation and real implementation environments. The set of six codes are simulated using COTSon simulator and also implemented on 28 dual-core real system. Obtained results from simulations showed an average of 12.6% speed-up of the proposed skeleton as compared to the customized programming; obtained speed-up in real environment is 9.6%. Furthermore, programming aided by the proposed skeleton, is at least 70% faster than custom programming while this difference increases as the program volume increases.
引用
收藏
页码:1094 / 1109
页数:15
相关论文
共 29 条
  • [1] Bader DA(2001)Cluster computing: applications Int. J. High Perform. Comput. 15 181-185
  • [2] Pennington R(1991)Distributed shared memory: a survey of issues and algorithms Computer 24 52-60
  • [3] Nitzberg B(2008)Merge: a programming model for heterogeneous multi-core systems ACM SIGOPS Oper. Syst. Rev. 42 287-296
  • [4] Lo V(2001)Efficient support for skeletons on workstation clusters Parallel Process. Lett. 11 41-56
  • [5] Linderman MD(2006)Quaff: efficient c++ design for parallel skeletons Parallel Comput. 32 604-615
  • [6] Collins JD(2010)A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers Softw. Pract. Exp. 40 1135-1160
  • [7] Wang H(2010)The Cilk++ concurrency platform J. Supercomput. 51 244-257
  • [8] Meng TH(2009)COTSon: infrastructure for full system simulation ACM SIGOPS Oper. Syst. Rev. 43 52-61
  • [9] Danelutto M(1996)On the utility of communication computation overlap in data-parallel programs J. Parallel Distrib. Comput. 33 197-204
  • [10] Falcou J(1965)An algorithm for the machine calculation of complex Fourier series Math. Comput. 19 297-301