Fine-Granular Computation and Data Layout Reorganization for Improving Locality

被引:0
作者
Kandemir, Mahmut [1 ]
Tang, Xulong [2 ]
Kotra, Jagadish [3 ]
Karakoy, Mustafa [4 ]
机构
[1] Penn State Univ, State Coll, PA 16801 USA
[2] Univ Pittsburgh, Pittsburgh, PA USA
[3] AMD Res, Austin, TX USA
[4] TUBITAK BILGEM, Kocaeli, Turkey
来源
2022 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN, ICCAD | 2022年
关键词
Data locality; Data layout; Code optimization; CACHE PERFORMANCE; OPTIMIZATIONS; TRANSFORMATIONS; PARALLELISM; LOOP;
D O I
10.1145/3508352.3549386
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
While data locality and cache performance have been investigated in great depth by prior research (in the context of both high-end systems and embedded/mobile systems), one of the important characteristics of prior approaches is that they transform loop and/or data space (e.g., array layout) as a whole. Unfortunately, such coarse-grain approaches bring three critical issues. First, they implicitly assume that all parts of a given array would equally benefit from the identified data layout transformation. Second, they also assume that a given loop transformation would have the same locality impact on an entire data array. Third and more importantly, such coarse-grain approaches are local by their nature and difficult to achieve globally optimal executions. Motivated by these drawbacks of existing code and data space reorganization/optimization techniques, this paper proposes to determine multiple loop transformation matrices for each loop nest in the program and multiple data layout transformations for each array accessed by the program, in an attempt to exploit data locality at a finer granularity. It leverages bipartite graph matching and extends the proposed fine-granular integrated loop-layout strategy to a multicore setting as well. Our experimental results show that the proposed approach significantly improves the data locality and outperforms existing schemes - 9.1% average performance improvement in single-threaded executions and 11.5% average improvement in multi-threaded executions over the state-of-the-art.
引用
收藏
页数:9
相关论文
共 53 条
[1]   A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing [J].
Ahn, Junwhan ;
Hong, Sungpack ;
Yoo, Sungjoo ;
Mutlu, Onur ;
Choi, Kiyoung .
2015 ACM/IEEE 42ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2015, :105-117
[2]   PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture [J].
Ahn, Junwhan ;
Yoo, Sungjoo ;
Mutlu, Onur ;
Choi, Kiyoung .
2015 ACM/IEEE 42ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2015, :336-348
[3]   Data Reorganization in Memory Using 3D-stacked DRAM [J].
Akin, Berkin ;
Franchetti, Franz ;
Hoe, James C. .
2015 ACM/IEEE 42ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2015, :131-143
[4]  
ANDERSON JM, 1993, SIGPLAN NOTICES, V28, P112, DOI 10.1145/173262.155101
[5]  
[Anonymous], 2014, HPDC
[6]  
[Anonymous], 1996, ASPLOS
[7]  
Aslot V, 2001, LECT NOTES COMPUT SC, V2104, P1
[8]  
Balasubramonian Rajeev, 2014, MICROIEEE
[9]  
Bienia C., 2011, Benchmarking modern multiprocessors