Soft Error Resilience at Near-Zero Cost

被引:1
作者
Zeng, Jianping [1 ]
Huang, Shao-Yu [1 ]
Liu, Jiuyang [2 ]
Jung, Changhee [1 ]
机构
[1] Purdue Univ, W Lafayette, IN 47907 USA
[2] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
来源
PROCEEDINGS OF THE 38TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ACM ICS 2024 | 2024年
关键词
soft error resilience; compiler; computer architecture; OPTIMIZATIONS;
D O I
10.1145/3650200.3656605
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Among existing schemes for soft error resilience, acoustic-sensor-based detection stands out owing to its ability to prevent silent data corruption at low hardware cost. However, the state-of-the-art work not only incurs a considerable run-time overhead but also complicates the processor pipeline with intrusive micro-architectural modifications, hindering its practical deployment in real silicon. To this end, this paper presents VeriPipe, a near-zero-cost compiler/architecture codesign scheme for soft error resilience. VeriPipe compiler partitions input program to a series of regions (epochs) statically, while VeriPipe hardware verifies if they are error-free dynamically. In particular, VeriPipe achieves a simple yet efficient region-level verification where each region goes through a three-stage (Execute, Verify, and Commit) verification pipeline to ensure the absence of soft errors before proceeding to the next region. In particular, VeriPipe hardware overlaps the Verify stage of each region with the Execute stage of the next region, thereby effectively hiding the Verify delay. Experiments with 43 applications from SPEC2006/2017/NPB-CPP/SPLASH3/DoE Mini-Apps highlight the negligible overheads of VeriPipe, i.e., an average of 1% run-time overhead and a storage overhead of only 3 registers and 1 countdown timer.
引用
收藏
页码:176 / 187
页数:12
相关论文
共 72 条
  • [51] Silent Data Corruptions: Microarchitectural Perspectives
    Papadimitriou, George
    Gizopoulos, Dimitris
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2023, 72 (11) : 3072 - 3085
  • [52] Dual use of superscalar datapath for transient-fault detection and recovery
    Ray, J
    Hoe, JC
    Falsafi, B
    [J]. 34TH ACM/IEEE INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, MICRO-34, PROCEEDINGS, 2001, : 214 - 224
  • [53] SWIFT: Software implemented fault tolerance
    Reis, GA
    Chang, J
    Vachharajani, N
    Rangan, R
    August, DI
    [J]. CGO 2005: INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, 2005, : 243 - 254
  • [54] The Entangling Instruction Prefetcher
    Ros, Alberto
    Jimborean, Alexandra
    [J]. IEEE COMPUTER ARCHITECTURE LETTERS, 2020, 19 (02) : 84 - 87
  • [55] Sakalis C, 2016, INT SYM PERFORM ANAL, P101, DOI 10.1109/ISPASS.2016.7482078
  • [56] So H, 2019, DES AUT TEST EUROPE, P1559, DOI [10.23919/date.2019.8715089, 10.23919/DATE.2019.8715089]
  • [57] So H, 2018, DES AUT TEST EUROPE, P533, DOI 10.23919/DATE.2018.8342065
  • [58] Marvell ThunderX3: Next-Generation Arm-Based Server Processor
    Sugumar, Rabin
    Shah, Mehul
    Ramirez, Ricardo
    [J]. IEEE MICRO, 2021, 41 (02) : 15 - 20
  • [59] Tramm John R, 2014, PHYSOR
  • [60] Upasani Gaurang R, 2016, Ph. D. Dissertation