Physics-Informed Machine Learning for DRAM Error Modeling

被引:0
|
作者
Baseman, Elisabeth [1 ]
DeBardeleben, Nathan [1 ]
Blanchard, Sean [1 ]
Moore, Juston [2 ]
Tkachenko, Olena [3 ]
Ferreira, Kurt [4 ]
Siddiqua, Taniya [5 ]
Sridharan, Vilas [5 ]
机构
[1] Los Alamos Natl Lab, Ultrascale Syst Res Ctr, Los Alamos, NM 87545 USA
[2] Los Alamos Natl Lab, Adv Res Cyber Syst, Los Alamos, NM USA
[3] New Mexico Consortium, Los Alamos, NM USA
[4] Sandia Natl Labs, Ctr Comp Res, Livermore, CA 94550 USA
[5] Adv Micro Devices Inc, RAS Architecture, Sunnyvale, CA 94088 USA
来源
2018 IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI AND NANOTECHNOLOGY SYSTEMS (DFT) | 2018年
关键词
IMPACT;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
As the scale of high performance computing facilities approaches the exascale era, gaining a detailed understanding of hardware failures becomes important. In particular, the extreme memory capacity of modern supercomputers means that data corruption errors which were statistically negligible at smaller scales will become more prevalent. In order to understand hardware faults and mitigate their adverse effects on exascale workloads, we must learn from the behavior of current hardware. In this work, we investigate the predictability of DRAM errors using field data from two recently decommissioned supercomputers: Cielo, at Los Alamos National Laboratory, and Hopper, at Lawrence Berkeley National Laboratory. Due to the volume and complexity of the field data, we apply statistical machine learning to predict the probability of DRAM errors at previously un-accessed locations. We compare the predictive performance of six machine learning algorithms, and find that a model incorporating physical knowledge of DRAM spatial structure outperforms purely statistical methods. Our findings both support expected physical behavior of DRAM hardware as well as providing a mechanism for real-time error prediction. We demonstrate real-world feasibility by training an error model on one supercomputer and effectively predicting errors on another. Our methods demonstrate the importance of spatial locality over temporal locality in DRAM errors, and show that relatively simple statistical models are effective at predicting future errors based on historical data, allowing proactive error mitigation.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Physics-informed machine learning for modeling multidimensional dynamics
    Abbasi, Amirhassan
    Kambali, Prashant N.
    Shahidi, Parham
    Nataraj, C.
    NONLINEAR DYNAMICS, 2024, 112 (24) : 21565 - 21585
  • [2] Physics-informed Machine Learning for Modeling Turbulence in Supernovae
    Karpov, Platon I.
    Huang, Chengkun
    Sitdikov, Iskandar
    Fryer, Chris L.
    Woosley, Stan
    Pilania, Ghanshyam
    ASTROPHYSICAL JOURNAL, 2022, 940 (01):
  • [3] Physics-informed machine learning
    George Em Karniadakis
    Ioannis G. Kevrekidis
    Lu Lu
    Paris Perdikaris
    Sifan Wang
    Liu Yang
    Nature Reviews Physics, 2021, 3 : 422 - 440
  • [4] Physics-informed machine learning
    Karniadakis, George Em
    Kevrekidis, Ioannis G.
    Lu, Lu
    Perdikaris, Paris
    Wang, Sifan
    Yang, Liu
    NATURE REVIEWS PHYSICS, 2021, 3 (06) : 422 - 440
  • [5] Wind Farm Modeling with Interpretable Physics-Informed Machine Learning
    Howland, Michael F.
    Dabiri, John O.
    ENERGIES, 2019, 12 (14)
  • [6] Physics-Informed Machine Learning for Modeling and Control of Dynamical Systems
    Nghiem, Truong X.
    Drgona, Jan
    Jones, Colin
    Nagy, Zoltan
    Schwan, Roland
    Dey, Biswadip
    Chakrabarty, Ankush
    Di Cairano, Stefano
    Paulson, Joel A.
    Carron, Andrea
    Zeilinger, Melanie N.
    Cortez, Wenceslao Shaw
    Vrabie, Draguna L.
    2023 AMERICAN CONTROL CONFERENCE, ACC, 2023, : 3735 - 3750
  • [7] A review of physics-informed machine learning for building energy modeling
    Ma, Zhihao
    Jiang, Gang
    Hu, Yuqing
    Chen, Jianli
    APPLIED ENERGY, 2025, 381
  • [8] Unit Operation and Process Modeling with Physics-Informed Machine Learning
    Li, Haochen
    Spelman, David
    Sansalone, John
    JOURNAL OF ENVIRONMENTAL ENGINEERING, 2024, 150 (04)
  • [9] Separable physics-informed DeepONet: Breaking the curse of dimensionality in physics-informed machine learning
    Mandl, Luis
    Goswami, Somdatta
    Lambers, Lena
    Ricken, Tim
    COMPUTER METHODS IN APPLIED MECHANICS AND ENGINEERING, 2025, 434
  • [10] Physics-informed machine-learning for modeling aero-optics
    Kutz, J. Nathan
    Sashidhar, Diya
    Sahba, Shervin
    Brunton, Steven L.
    McDaniel, Austin
    Wilcox, Christopher C.
    APPLIED OPTICAL METROLOGY IV, 2021, 11817