Towards Highly Efficient DGEMM on the Emerging SW26010 Many-core Processor

被引:32
作者
Jiang, Lijuan [1 ,2 ]
Yang, Chao [1 ,3 ]
Ao, Yulong [1 ,2 ]
Yin, Wanwang [4 ]
Ma, Wenjing [1 ,3 ]
Sun, Qiao [1 ]
Liu, Fangfang [1 ,2 ]
Lin, Rongfen [4 ]
Zhang, Peng [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Software, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Chinese Acad Sci, State Key Lab Comp Sci, Beijing, Peoples R China
[4] Natl Res Ctr Parallel Comp Engn & Technol, Beijing, Peoples R China
来源
2017 46TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP) | 2017年
基金
中国国家自然科学基金;
关键词
DGEMM; dense linear algebra; SW26010; processor; many-core architecture; Sunway TaihuLight; LINPACK BENCHMARK;
D O I
10.1109/ICPP.2017.51
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The matrix-matrix multiplication is an essential building block that can be found in various scientific and engineering applications. High-performance implementations of the matrix-matrix multiplication on state-of-the-art processors may be of great importance for both the vendors and the users. In this paper, we present a detailed methodology of implementing and optimizing the double-precision general format matrix-matrix multiplication (DGEMM) kernel on the emerging SW26010 processor, which is used to build the Sunway TaihuLight supercomputer. We propose a three-level blocking algorithm to orchestrate data on the memory hierarchy and expose parallelism on different hardware levels, and design a collective data sharing scheme by using the register communication mechanism to exchange data efficiently among different cores. On top of those, further optimizations are done based on a data-thread mapping method for efficient data distribution, a double buffering scheme for asynchronous DMA data transfer, and an instruction scheduling method for maximizing the pipeline usage. Experiment results show that the proposed DGEMM implementation can fully exploit the unique hardware features provided by SW26010 and can sustain up to 95% of the peak performance.
引用
收藏
页码:422 / 431
页数:10
相关论文
共 27 条
  • [1] [Anonymous], P 2011 INT C HIGH PE
  • [2] [Anonymous], 1998, SC 98, DOI [10.5555/509058.509096, DOI 10.1109/SC.1998.10004]
  • [3] [Anonymous], 2013, Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, DOI DOI 10.1145/2503210.2503219
  • [4] [Anonymous], 2006, 10 INT WORKSHOP FRON
  • [5] Optimized HPL for AMD GPU and multi-core CPU usage
    Bach, Matthias
    Kretz, Matthias
    Lindenstruth, Volker
    Rohr, David
    [J]. COMPUTER SCIENCE-RESEARCH AND DEVELOPMENT, 2011, 26 (3-4): : 153 - 164
  • [6] Balay S., 2001, PETSC
  • [7] Demsar J, 2013, J MACH LEARN RES, V14, P2349
  • [8] Dongarra J, 2013, PPAM 2013, P571
  • [9] Sunway TaihuLight supercomputer makes its appearance
    Dongarra, Jack
    [J]. NATIONAL SCIENCE REVIEW, 2016, 3 (03) : 265 - 266
  • [10] DONGARRA JJ, 1990, ACM T MATH SOFTWARE, V16, P1, DOI 10.1145/77626.79170