Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor

被引:83
作者
Zhang Xianyi [1 ]
Wang Qian [1 ]
Zhang Yunquan [1 ]
机构
[1] Chinese Acad Sci, Inst Software, Lab Parallel Software & Computat Sci, Beijing 100190, Peoples R China
来源
PROCEEDINGS OF THE 2012 IEEE 18TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2012) | 2012年
关键词
BLAS; Loongson; 3A; MIPS64; Optimization; Multi-core;
D O I
10.1109/ICPADS.2012.97
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Every mainstream processor vendor provides an optimized BLAS implementation for its CPU, as BLAS is a fundamental math library in scientific computing. The Loongson 3A CPU is a general-purpose 64-bit MIPS64 quad-core processor, developed by the Institute of Computing Technology, Chinese Academy of Sciences. To date, there has not been a sufficiently optimized BLAS on the Loongson 3A CPU. The purpose of this research is to optimize level 3 BLAS performance on the Loongson 3A CPU. We analyzed the Loongson 3A architecture and built a performance model to highlight the key point, L1 data cache misses, which is different from level 3 BLAS optimization on the mainstream x86 CPU. Therefore, we employed a variety of methods to avoid L1 cache misses in single thread optimization, including cache and register blocking, the Loongson 3A 128-bit memory accessing extension instructions, software prefetching, and single precision floating-point SIMD instructions. Furthermore, we improved parallel performance by reducing bank conflicts among multiple threads in the shared L2 cache. We created an open source BLAS project, OpenBLAS, to demonstrate the performance improvement on the Loongson 3A quad-core processor.
引用
收藏
页码:684 / 691
页数:8
相关论文
共 13 条
[1]  
[Anonymous], 2012, COMPUTER ARCHITECTUR
[2]   An updated set of Basic Linear Algebra Subprograms (BLAS) [J].
Blackford, LS ;
Demmel, J ;
Dongarra, J ;
Duff, I ;
Hammarling, S ;
Henry, G ;
Heroux, M ;
Kaufman, L ;
Lumsdaine, A ;
Petitet, A ;
Pozo, R ;
Remington, K ;
Whaley, RC .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2002, 28 (02) :135-151
[3]  
Cui HM, 2011, INT SYM CODE GENER, P107, DOI 10.1109/CGO.2011.5764679
[4]   The LINPACK benchmark: past, present and future [J].
Dongarra, JJ ;
Luszczek, P ;
Petitet, A .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2003, 15 (09) :803-820
[5]   Anatomy of high-performance matrix multiplication [J].
Goto, Kazushige ;
Van De Geijn, Robert A. .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2008, 34 (03)
[6]   High-performance implementation of the level-3 BLAS [J].
Goto, Kazushige ;
Van De Geijn, Robert .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2008, 35 (01) :1-14
[7]  
Gu Nai-Jie, 2008, Journal of University of Science and Technology of China, V38, P854
[8]  
He Song-song, 2012, Journal of Chinese Computer Systems, V33, P571
[9]  
[李毅 Li Yi], 2011, [计算机系统应用, Computer Systems & Applications], V20, P163
[10]  
Loongson Technology Corp. Ltd, 2009, LOONGS 3A PROC MAN