FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUs

被引：0

作者：

Zhai, Yujia ^{[1
]}

Giem, Elisabeth ^{[1
]}

Zhao, Kai ^{[2
]}

Liu, Jinyang ^{[1
]}

Huang, Jiajun ^{[1
]}

Wong, Bryan M. ^{[1
]}

Shelton, Christian R. ^{[1
]}

Chen, Zizhong ^{[1
]}

机构：

[1] Univ Calif Riverside, Riverside, CA 92521 USA

[2] Univ Alabama Birmingham, Birmingham, AL 35294 USA

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2023年 / 34卷 / 12期

关键词：

BLAS; SIMD; assembly optimization; dual modular redundancy; algorithm-based fault tolerance; AVX-512; AVX2; OpenMP; parallel algorithm; ERROR-DETECTION; ALGORITHM; SOFTWARE; RESILIENCE; SCHEME;

D O I：

10.1109/TPDS.2023.3316011

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Basic Linear Algebra Subprograms (BLAS) serve as a foundational library for scientific computing and machine learning. In this article, we present a new BLAS implementation, FT-BLAS, that provides performance comparable to or faster than state-of -the-art BLAS libraries, while being capable of tolerating soft errors on-the-fly. At the algorithmic level, we propose a hybrid strategy to incorporate fault-tolerant functionality. For memory-bound Level-1 and Level-2 BLAS routines, we duplicate computing instructions and re-use data at the register level to avoid memory overhead when validating the runtime correctness. Here we novelly propose to utilize mask registers on AVX512-enabled processors and SIMD registers on AVX2-enabled processors to store intermediate com-parison results. For compute-bound Level-3 BLAS routines, we fuse memory-intensive operations such as checksum encoding and verification into the GEMM assembly kernels to optimize the mem-ory footprint. We also design cache-friendly parallel algorithms for our fault-tolerant library. Through a series of architectural-aware optimizations, we manage to maintain the fault-tolerant overhead at a negligible order (<3%). Experimental results obtained on widely-used processors such as Intel Skylake, Intel Cascade Lake, and AMD Zen2 demonstrate that FT-BLAS offers high reliability and high performance - faster than Intel MKL, OpenBLAS, and BLIS by up to 3.50%, 22.14%, and 21.70%, respectively, for both serial and parallel routines spanning all three levels of BLAS we benchmarked, even under hundreds of errors injected per minute.

引用

页码：3207 / 3223

页数：17

共 26 条

[1] FT-BLAS: A High Performance BLAS Implementation With Online Fault Tolerance
Zhai, Yujia
Giem, Elisabeth
Fan, Quan
Zhao, Kai
Liu, Jinyang
Chen, Zizhong
PROCEEDINGS OF THE 2021 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ICS 2021, 2021, : 127 - 138
[2] FT-GEMM: A Fault Tolerant High Performance GEMM Implementation on x86 CPUs
Wu, Shixun
Zhai, Yujia
Huang, Jiajun
Jian, Zizhe
Chen, Zizhong
PROCEEDINGS OF THE 32ND INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, HPDC 2023, 2023, : 323 - 324
[3] Time and energy modeling of high-performance Level-3 BLAS on x86 architectures
Alonso, Pedro
Catalan, Sandra
Igual, Francisco D.
Mayo, Rafael
Rodriguez-Sanchez, Rafael
Quintana-Orti, Enrique S.
SIMULATION MODELLING PRACTICE AND THEORY, 2015, 55 : 77 - 94
[4] A high-performance implementation of atomistic spin dynamics simulations on x86 CPUs
Chen, Hongwei
Zhai, Yujia
Turner, Joshua J.
Feiguin, Adrian
COMPUTER PHYSICS COMMUNICATIONS, 2023, 291
[5] SSE Implementation of Multivariate PKCs on Modern x86 CPUs
Chen, Anna Inn-Tung
Chen, Ming-Shing
Chen, Tien-Ren
Cheng, Chen-Mou
Ding, Jintai
Kuo, Eric Li-Hsiang
Lee, Frost Yu-Shuang
Yang, Bo-Yin
CRYPTOGRAPHIC HARDWARE AND EMBEDDED SYSTEMS - CHES 2009, PROCEEDINGS, 2009, 5747 : 33 - +
[6] Design and implementation of high performance BLAS for Pentium Pro
Li, Zhongze
Chen, Jin
Long, Xiang
Li, Wei
Ruan Jian Xue Bao/Journal of Software, 1998, 9 (05): : 454 - 457
[7] Automatic Generation of High-Performance FFT Kernels on Arm and X86 CPUs
Li, Zhihao
Jia, Haipeng
Zhang, Yunquan
Chen, Tun
Yuan, Liang
Vuduc, Richard
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (08) : 1925 - 1941
[8] High-performance implementation of the level-3 BLAS
Goto, Kazushige
Van De Geijn, Robert
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2008, 35 (01): : 1 - 14
[9] FitenBLAS: High performance BLAS for a massively multithreaded FT1000 processor
Chi, Li-Hua
Liu, Jie
Yan, Yi-Hui
Xie, Lin-Chuan
Gan, Xin-Biao
Hu, Qin-Feng
Jiang, Jie
Li, Sheng-Guo
Hunan Daxue Xuebao/Journal of Hunan University Natural Sciences, 2015, 42 (04): : 100 - 106
[10] AUGEM:Automatically Generate High Performance Dense Linear Algebra Kernels on x86 CPUs
Wang, Qian
Zhang, Xianyi
Zhang, Yunquan
Yi, Qing
2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2013,

← 1 2 3 →