Silent Data Corruptions: Microarchitectural Perspectives

被引:15
作者
Papadimitriou, George [1 ]
Gizopoulos, Dimitris [1 ]
机构
[1] Natl & Kapodistrian Univ Athens, Dept Informat & Telecommun, Athens 15784, Greece
关键词
Hardware; Circuit faults; Microarchitecture; Software; Redundancy; Error correction codes; Computer bugs; Silent data corruptions; faults; errors; microarchitecture; microprocessor; fault injection; ARCHITECTURAL VULNERABILITY FACTOR; ERROR; PROPAGATION;
D O I
10.1109/TC.2023.3285094
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Today more than ever before, academia, manufacturers, and hyperscalers acknowledge the major challenge of silent data corruptions (SDCs) and aim on solutions to minimize its impact by avoiding, detecting, and mitigating SDCs. Recent studies on large scale datacenters conducted by Meta and Google report an unexpected rate of silent data corruption incidents that are attributed to modern microprocessor generations. Despite the acknowledged severity of the phenomenon, particularly at the datacenter scale, there is no in-depth analysis of the microarchitectural locations in a complex microprocessor that are more likely to generate an SDC at the program outputs. In this paper, we present a detailed analysis of the faulty behavior of many critical microarchitectural structures of a modern out-of-order microprocessor generating silent data corruptions. Our analysis unveils several observations, including: (i) the magnitude of silent data corruptions attributed to different hardware structures, (ii) the instruction-related parameters that are more likely to result in a silent data corruption, (iii) the extent to which the operating system affects the silent data corruption occurrences, and (iv) the byte positions of a word which are more likely to result in silent data corruptions. Collectively, such findings can assist decisions for hardware and software schemes for the reduction of the likelihood of silent data corruptions generation.
引用
收藏
页码:3072 / 3085
页数:14
相关论文
共 45 条
  • [1] [Anonymous], 2022, Pytorch elastic documentation
  • [2] Binkert Nathan, 2011, Computer Architecture News, V39, P1, DOI 10.1145/2024716.2024718
  • [3] Soft Error Effects on Arm Microprocessors: Early Estimations versus Chip Measurements
    Bodmann, Pablo R.
    Papadimitriou, George
    Rech Junior, Rubens L.
    Gizopoulos, Dimitris
    Rech, Paolo
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2022, 71 (10) : 2358 - 2369
  • [4] Chatzidimitriou A, 2019, I S WORKL CHAR PROC, P119, DOI 10.1109/IISWC47752.2019.9042036
  • [5] Assessing the Effects of Low Voltage in Branch Prediction Units
    Chatzidimitriou, Athanasios
    Papadimitriou, George
    Gizopoulos, Dimitris
    Ganapathy, Shrikanth
    Kalamatianos, John
    [J]. 2019 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS), 2019, : 127 - 136
  • [6] Cho Hyungmin., 2013, Design Automation Conference (DAC), 2013 50th ACM/EDAC/IEEE, P1, DOI [10.1145/2463209.2488859, DOI 10.1145/2463209.2488859]
  • [7] Silent Data Corruption - Myth or Reality?
    Constantinescu, Cristian
    Parulkar, Ishwar
    Harper, Rick
    Michalak, Sarah
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS & NETWORKS WITH FTCS & DCC, 2008, : 108 - 109
  • [8] nZDC: A Compiler technique for near Zero Silent data Corruption
    Didehban, Moslem
    Shrivastava, Aviral
    [J]. 2016 ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2016,
  • [9] Dixit H. D., 2021, arXiv
  • [10] Duan LD, 2009, INT S HIGH PERF COMP, P129, DOI 10.1109/HPCA.2009.4798244