Configurable XOR Hash Functions for Banked Scratchpad Memories in GPUs

被引:3
作者
van den Braak, Gert-Jan [1 ]
Gomez-Luna, Juan [2 ]
Maria Gonzalez-Linares, Jose [3 ]
Corporaal, Henk [1 ]
Guil, Nicolas [3 ]
机构
[1] Eindhoven Univ Technol, NL-5600 MB Eindhoven, Netherlands
[2] Univ Cordoba, E-14071 Cordoba, Spain
[3] Univ Malaga, Dept Comp Architecture, E-29071 Malaga, Spain
关键词
Computer architecture; graphics processing units; memory architecture;
D O I
10.1109/TC.2015.2479595
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Scratchpad memories in GPU architectures are employed as software-controlled caches to increase the effective GPU memory bandwidth. Through the use of well-known optimization techniques, such as privatization and tiling, they are properly exploited. Typically, they are banked memories which are addressed with a mod(2(N)) bank indexing scheme. Although their bandwidth is fully exploited for linear memory accesses, their performance is burdened when non-unit strides appear in memory access patterns because they provoke bank conflicts. This paper explores the use of configurable bit-vector and bitwise XOR-based hash functions to evenly distribute memory addresses of the access patterns over the memory banks, reducing the number of bank conflicts. An exhaustive, but lightweight, search is used to configure bit-vector hash functions. Bitwise hash functions are configured with heuristics. Hardware and software implementations are carried out. For the hardware approach, the experimental results show 24 percent performance speed-up for 22 benchmarks on GPGPU-Sim, a Fermi-like simulator. Bank conflicts are reduced by 96 percent with bit-vector hash functions, and 97 percent with bitwise hash functions using our proposed Minimum Imbalance Heuristic. The software approach, using bit-vector hash functions, demonstrates 23 percent speed-up and 96 percent bank conflict reduction on a Fermi GPU, and 33 percent speed-up and 99 percent bank conflict reduction on a Kepler GPU.
引用
收藏
页码:2045 / 2058
页数:14
相关论文
共 27 条
[1]  
AMD, 2014, CODEXL PROF 1 4
[2]  
Bakhoda A, 2009, INT SYM PERFORM ANAL, P163, DOI 10.1109/ISPASS.2009.4919648
[3]  
Che SA, 2009, I S WORKL CHAR PROC, P44, DOI 10.1109/IISWC.2009.5306797
[4]  
Coon B., 2011, US Patent, Patent No. 8055856
[5]   Arbitrary Modulus Indexing [J].
Diamond, Jeffrey R. ;
Fussell, Donald S. ;
Keckler, Stephen W. .
2014 47TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2014, :140-152
[6]  
Dotsenko Y, 2008, ICS'08: PROCEEDINGS OF THE 2008 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, P205
[7]  
Fang JB, 2014, SCI PROGRAMMING-NETH, V22, P239, DOI [10.1155/2014/623841, 10.3233/SPR-140390]
[8]  
Frailong J. M., 1985, Proceedings of the 1985 International Conference on Parallel Processing (Cat. No.85CH2140-2), P276
[9]  
Givargis T, 2003, DES AUT CON, P875
[10]   Performance Modeling of Atomic Additions on GPU Scratchpad Memory [J].
Gomez-Luna, Juan ;
Maria Gonzalez-Linares, Jose ;
Benavides Benitez, Jose Ignacio ;
Guil Mata, Nicolas .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2013, 24 (11) :2273-2282