Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

被引:0
作者
Feng Wang
Can-Qun Yang
Yun-Fei Du
Juan Chen
Hui-Zhan Yi
Wei-Xia Xu
机构
[1] National University of Defense Technology,School of Computer Science
来源
Journal of Computer Science and Technology | 2011年 / 26卷
关键词
petascale; Linpack; GPU; heterogeneous; supercomputer;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system of China, and the largest GPU-accelerated heterogeneous system ever attempted before. A hybrid programming model consisting of MPI, OpenMP and streaming computing is described to explore the task parallel, thread parallel and data parallel of the Linpack. We explain how we optimized the load distribution across the CPUs and GPUs using the two-level adaptive method and describe the implementation in details. To overcome the low-bandwidth between the CPU and GPU communication, we present a software pipelining technique to hide the communication overhead. Combined with other traditional optimizations, the Linpack we developed achieved 196:7 GFLOPS on a single compute element of TianHe-1. This result is 70:1% of the peak compute capability, 3:3 times faster than the result by using the vendor's library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0:563 PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list in November, 2009.
引用
收藏
页码:854 / 865
页数:11
相关论文
共 32 条
[1]  
Dongarra JJ(1994)Scalability issues affecting the design of a dense linear algebra library J. Parallel Distrib. Comput. 22 523-537
[2]  
van de Geijn RA(2009)Parallel LDPC decoding on GPUs using a stream-based computing approach Journal of Computer Science and Technology 24 913-924
[3]  
Walker DW(2003)The linpack benchmark: Past, present and future Concurrency and Computation: Practice and Experience 15 803-820
[4]  
Falcao G(1990)A set of level 3 basic linear algebra subprograms ACM Trans. Math. Softw. 16 1-17
[5]  
Yamagiwa S(2008)Original 45 nm Intels Core2 processor performance Intel Technology Journal 11 157-168
[6]  
Silva V(1995)A three-dimensional approach to parallel matrix multiplication IBM Journal of Research and Development 39 575-582
[7]  
Sousa L(2008)Merge: A programming model for heterogeneous multi-core systems SIGOPS Oper. Syst. Rev. 42 287-296
[8]  
Dongarra JJ(2007)Introduction to the cell broadband engine architecture IBM J. Res. Dev. 51 503-519
[9]  
Luszczek P(undefined)undefined undefined undefined undefined-undefined
[10]  
Petitet A(undefined)undefined undefined undefined undefined-undefined