An Automatic Parallel-Stage Decoupled Software Pipelining Parallelization Algorithm Based on OpenMP

被引:0
作者
Liu, Xiaoxian [1 ]
Zhao, Rongcai [1 ]
Han, Lin [1 ]
Liu, Peng [1 ]
机构
[1] State Key Lab Math Engn & Adv Comp, Zhengzhou, Peoples R China
来源
2013 12TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2013) | 2013年
关键词
automatic parallelization; OpenMP; parallel-stage decoupled software pipelining;
D O I
10.1109/TrustCom.2013.227
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
While multicore processors increase throughput for multi-programmed and multithreaded codes, many important applications are single threaded and thus do not benefit. Automatic parallelization techniques play an important role in migrating singe threaded applications to multicore platforms. Unfortunately, the prevalence of control flow, recursive data structures, and general pointer accesses in ordinary programs renders the traditional automatic parallelization techniques unsuitable. Parallel-Stage Decoupled Software Pipelining (PS-DSWP) is proposed to exploit fine-grained pipeline parallelism lurking in ordinary programs with the existence of all kinds of dependences, including arbitrary control dependences, at the instruction level. But it requires knowledge of architectural properties and hardware support of a communication channel and two special instructions. We propose an improved PS-DSWP algorithm based on OpenMP in this paper. It is implemented without relying on CPU architectures by using a high level intermediate representation. Moreover, the Program Dependence Graph (PDG) used in the algorithm is built based on the basic blocks, which exploits coarser-grained parallelism than the original PS-DSWP transformation with PDG based on instructions. OpenMP is employed in our algorithm to assign task and implement synchronization among threads while avoiding dependence on hardware support. We evaluate the loops with complex memory patterns and control flow, which cannot be dealt with by traditional techniques, on multicore platform. As a result, they can be parallelized and gain significant performance improvement with our algorithm. We obtain a maximum speedup as high as 2.07x and on average 1.39x with 5 threads.
引用
收藏
页码:1825 / 1831
页数:7
相关论文
共 16 条
[1]  
[Anonymous], 2001, OPTIMIZING COMPILERS
[2]  
Bondhugula U, 2008, LECT NOTES COMPUT SC, V4959, P132
[3]  
Campanoni S., 2012, P 10 INT S COD GEN O, P84, DOI 10.1145/2259016.2259028
[4]  
Canedo A, 2010, INT SYM CODE GENER, P151
[5]  
Cytron R., 1986, Proceedings of the 1986 International Conference on Parallel Processing (Cat. No.86CH2355-6), P836
[6]  
Davies J. R. B., 1981, THESIS U ILLINOIS UR
[7]   THE PROGRAM DEPENDENCE GRAPH AND ITS USE IN OPTIMIZATION [J].
FERRANTE, J ;
OTTENSTEIN, KJ ;
WARREN, JD .
ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS, 1987, 9 (03) :319-349
[8]   Fault tolerance scheme using parallel recomputing for OpenMP programs [J].
Fu, Hong-Yi ;
Ding, Yan ;
Song, Wei ;
Yang, Xue-Jun .
Ruan Jian Xue Bao/Journal of Software, 2012, 23 (02) :411-427
[9]  
Lee S, 2011, PROCEED CGO, P130, DOI 10.1109/CGO.2011.5764681
[10]  
Ottoni G, 2005, INT SYMP MICROARCH, P105