FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUs

被引:0
|
作者
Zhai, Yujia [1 ]
Giem, Elisabeth [1 ]
Zhao, Kai [2 ]
Liu, Jinyang [1 ]
Huang, Jiajun [1 ]
Wong, Bryan M. [1 ]
Shelton, Christian R. [1 ]
Chen, Zizhong [1 ]
机构
[1] Univ Calif Riverside, Riverside, CA 92521 USA
[2] Univ Alabama Birmingham, Birmingham, AL 35294 USA
关键词
BLAS; SIMD; assembly optimization; dual modular redundancy; algorithm-based fault tolerance; AVX-512; AVX2; OpenMP; parallel algorithm; ERROR-DETECTION; ALGORITHM; SOFTWARE; RESILIENCE; SCHEME;
D O I
10.1109/TPDS.2023.3316011
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Basic Linear Algebra Subprograms (BLAS) serve as a foundational library for scientific computing and machine learning. In this article, we present a new BLAS implementation, FT-BLAS, that provides performance comparable to or faster than state-of -the-art BLAS libraries, while being capable of tolerating soft errors on-the-fly. At the algorithmic level, we propose a hybrid strategy to incorporate fault-tolerant functionality. For memory-bound Level-1 and Level-2 BLAS routines, we duplicate computing instructions and re-use data at the register level to avoid memory overhead when validating the runtime correctness. Here we novelly propose to utilize mask registers on AVX512-enabled processors and SIMD registers on AVX2-enabled processors to store intermediate com-parison results. For compute-bound Level-3 BLAS routines, we fuse memory-intensive operations such as checksum encoding and verification into the GEMM assembly kernels to optimize the mem-ory footprint. We also design cache-friendly parallel algorithms for our fault-tolerant library. Through a series of architectural-aware optimizations, we manage to maintain the fault-tolerant overhead at a negligible order (<3%). Experimental results obtained on widely-used processors such as Intel Skylake, Intel Cascade Lake, and AMD Zen2 demonstrate that FT-BLAS offers high reliability and high performance - faster than Intel MKL, OpenBLAS, and BLIS by up to 3.50%, 22.14%, and 21.70%, respectively, for both serial and parallel routines spanning all three levels of BLAS we benchmarked, even under hundreds of errors injected per minute.
引用
收藏
页码:3207 / 3223
页数:17
相关论文
共 26 条
  • [1] FT-BLAS: A High Performance BLAS Implementation With Online Fault Tolerance
    Zhai, Yujia
    Giem, Elisabeth
    Fan, Quan
    Zhao, Kai
    Liu, Jinyang
    Chen, Zizhong
    PROCEEDINGS OF THE 2021 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ICS 2021, 2021, : 127 - 138
  • [2] FT-GEMM: A Fault Tolerant High Performance GEMM Implementation on x86 CPUs
    Wu, Shixun
    Zhai, Yujia
    Huang, Jiajun
    Jian, Zizhe
    Chen, Zizhong
    PROCEEDINGS OF THE 32ND INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, HPDC 2023, 2023, : 323 - 324
  • [3] Time and energy modeling of high-performance Level-3 BLAS on x86 architectures
    Alonso, Pedro
    Catalan, Sandra
    Igual, Francisco D.
    Mayo, Rafael
    Rodriguez-Sanchez, Rafael
    Quintana-Orti, Enrique S.
    SIMULATION MODELLING PRACTICE AND THEORY, 2015, 55 : 77 - 94
  • [4] A high-performance implementation of atomistic spin dynamics simulations on x86 CPUs
    Chen, Hongwei
    Zhai, Yujia
    Turner, Joshua J.
    Feiguin, Adrian
    COMPUTER PHYSICS COMMUNICATIONS, 2023, 291
  • [5] SSE Implementation of Multivariate PKCs on Modern x86 CPUs
    Chen, Anna Inn-Tung
    Chen, Ming-Shing
    Chen, Tien-Ren
    Cheng, Chen-Mou
    Ding, Jintai
    Kuo, Eric Li-Hsiang
    Lee, Frost Yu-Shuang
    Yang, Bo-Yin
    CRYPTOGRAPHIC HARDWARE AND EMBEDDED SYSTEMS - CHES 2009, PROCEEDINGS, 2009, 5747 : 33 - +
  • [6] Design and implementation of high performance BLAS for Pentium Pro
    Li, Zhongze
    Chen, Jin
    Long, Xiang
    Li, Wei
    Ruan Jian Xue Bao/Journal of Software, 1998, 9 (05): : 454 - 457
  • [7] Automatic Generation of High-Performance FFT Kernels on Arm and X86 CPUs
    Li, Zhihao
    Jia, Haipeng
    Zhang, Yunquan
    Chen, Tun
    Yuan, Liang
    Vuduc, Richard
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (08) : 1925 - 1941
  • [8] High-performance implementation of the level-3 BLAS
    Goto, Kazushige
    Van De Geijn, Robert
    ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2008, 35 (01): : 1 - 14
  • [9] FitenBLAS: High performance BLAS for a massively multithreaded FT1000 processor
    Chi, Li-Hua
    Liu, Jie
    Yan, Yi-Hui
    Xie, Lin-Chuan
    Gan, Xin-Biao
    Hu, Qin-Feng
    Jiang, Jie
    Li, Sheng-Guo
    Hunan Daxue Xuebao/Journal of Hunan University Natural Sciences, 2015, 42 (04): : 100 - 106
  • [10] AUGEM:Automatically Generate High Performance Dense Linear Algebra Kernels on x86 CPUs
    Wang, Qian
    Zhang, Xianyi
    Zhang, Yunquan
    Yi, Qing
    2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2013,