A data-localization compilation scheme using partial-static task assignment for Fortran coarse-grain parallel processing

被引:7
作者
Kasahara, H
Yoshida, A
机构
[1] Waseda Univ, Dept Elect Elect & Comp Engn, Shinjuku Ku, Tokyo 1698555, Japan
[2] Toho Univ, Dept Informat Sci, Chiba 2748510, Japan
关键词
parallelizing compilers; data localization; automatic data distribution; dynamic scheduling; coarse-grain parallel processing;
D O I
10.1016/S0167-8191(98)00026-X
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper proposes a compilation scheme for data localization using partial-static task assignment for Fortran coarse-grain parallel processing, or macro-dataflow processing, on a multiprocessor system with local memories and centralized shared memory. The data localization allows us to effectively use local memories and reduce data transfer overhead under dynamic task-scheduling environment. The proposed compilation scheme mainly consists of the following three parts: (1) loop-aligned decomposition, which decomposes each of the loops having data dependence among them into smaller loops, and groups the decomposed loops into data-localizable groups so that shared data among the decomposed loops inside each group can be passed via local memory and data transfer overhead among the groups can be minimum; (2) partial static task assignment, which gives information that the decomposed loops inside each data-localizable group are assigned to the same processor to a dynamic scheduling routine generator in the macro-dataflow compiler; (3) parallel machine code generation, which generates parallel machine code to pass shared data inside the group through local memory and transfer data among groups through centralized shared memory. This compilation scheme has been implemented for a multiprocessor system, OSCAR (Optimally SCheduled Advanced multiprocessoR), having centralized shared memory and distributed shared memory, in addition to local memory on each processor. Performance evaluation of OSCAR shows that macro-dataflow processing with the proposed data-localization scheme can reduce the execution time by 20%, in average, compared with macro-dataflow processing without data localization. (C) 1998 Elsevier Science B.V. All rights reserved.
引用
收藏
页码:579 / 596
页数:18
相关论文
共 39 条
[1]   AUTOMATIC PARTITIONING OF PARALLEL LOOPS AND DATA ARRAYS FOR DISTRIBUTED SHARED-MEMORY MULTIPROCESSORS [J].
AGARWAL, A ;
KRANZ, DA ;
NATARAJAN, V .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1995, 6 (09) :943-962
[2]  
Aho A., 1988, Compilers - Principles, Techniques and Tools
[3]  
AIDA K, 1995, P IEEE PAC RIM C COM
[4]  
ALLEN F, 1988, P 2 ACM INT C SUP
[5]  
ANDERSON J, 1993, P SIGPLAN 93 C PROGR, P112
[6]  
[Anonymous], P 7 SIAM C PAR PROC
[7]   AUTOMATIC PROGRAM PARALLELIZATION [J].
BANERJEE, U ;
EIGENMANN, R ;
NICOLAU, A ;
PADUA, DA .
PROCEEDINGS OF THE IEEE, 1993, 81 (02) :211-243
[8]  
Banerjee U., 1993, LOOP TRANSFORMATIONS
[9]  
BIC L, 1995, PARALLEL LANGUAGE CO
[10]   AUTOMATIC DETECTION OF PARALLELISM - A GRAND CHALLENGE FOR HIGH-PERFORMANCE COMPUTING [J].
BLUME, W ;
EIGENMANN, R ;
HOEFLINGER, J ;
PADUA, D ;
PETERSEN, P ;
RAUCHWERGER, L ;
TU, P .
IEEE PARALLEL & DISTRIBUTED TECHNOLOGY, 1994, 2 (03) :37-47