Soft Error Resilience at Near-Zero Cost

被引:1
作者
Zeng, Jianping [1 ]
Huang, Shao-Yu [1 ]
Liu, Jiuyang [2 ]
Jung, Changhee [1 ]
机构
[1] Purdue Univ, W Lafayette, IN 47907 USA
[2] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
来源
PROCEEDINGS OF THE 38TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ACM ICS 2024 | 2024年
关键词
soft error resilience; compiler; computer architecture; OPTIMIZATIONS;
D O I
10.1145/3650200.3656605
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Among existing schemes for soft error resilience, acoustic-sensor-based detection stands out owing to its ability to prevent silent data corruption at low hardware cost. However, the state-of-the-art work not only incurs a considerable run-time overhead but also complicates the processor pipeline with intrusive micro-architectural modifications, hindering its practical deployment in real silicon. To this end, this paper presents VeriPipe, a near-zero-cost compiler/architecture codesign scheme for soft error resilience. VeriPipe compiler partitions input program to a series of regions (epochs) statically, while VeriPipe hardware verifies if they are error-free dynamically. In particular, VeriPipe achieves a simple yet efficient region-level verification where each region goes through a three-stage (Execute, Verify, and Commit) verification pipeline to ensure the absence of soft errors before proceeding to the next region. In particular, VeriPipe hardware overlaps the Verify stage of each region with the Execute stage of the next region, thereby effectively hiding the Verify delay. Experiments with 43 applications from SPEC2006/2017/NPB-CPP/SPLASH3/DoE Mini-Apps highlight the negligible overheads of VeriPipe, i.e., an average of 1% run-time overhead and a storage overhead of only 3 registers and 1 countdown timer.
引用
收藏
页码:176 / 187
页数:12
相关论文
共 72 条
  • [1] Agarwal V, 2000, PROCEEDING OF THE 27TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, P248, DOI [10.1145/342001.339691, 10.1109/ISCA.2000.854395]
  • [2] Impact of Voltage Scaling on Soft Errors Susceptibility of Multicore Server CPUs
    Agiakatsikas, Dimitris
    Papadimitriou, George
    Karakostas, Vasileios
    Gizopoulos, Dimitris
    Psarakis, Mihalis
    Belanger-Champagne, Camille
    Blackmore, Ewart
    [J]. 56TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, MICRO 2023, 2023, : 957 - 971
  • [3] Aho A. V., 1986, Addison Wesley, V7, P9
  • [4] ParaMedic: Heterogeneous Parallel Error Correction
    Ainsworth, Sam
    Jones, Timothy M.
    [J]. 2019 49TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN 2019), 2019, : 201 - 213
  • [5] Parallel Error Detection Using Heterogeneous Cores
    Ainsworth, Sam
    Jones, Timothy M.
    [J]. 2018 48TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2018, : 338 - 349
  • [6] ARM Limited, 2019, ARM Cortex A77
  • [7] ARM limited Corporation, 2019, Cortex-a76 technique reference manual
  • [8] Bachrach J, 2012, DES AUT CON, P1212
  • [9] Bacon David F, 2022, Detection and Prevention of Silent Data Corruption in an Exabyte-scale Database System
  • [10] Binkert Nathan, 2011, Computer Architecture News, V39, P1, DOI 10.1145/2024716.2024718