Learning-Oriented Reliability Improvement of Computing Systems From Transistor to Application Level

被引:0
作者
Ranjbar, Behnaz [1 ]
Klemme, Florian [2 ]
Genssler, Paul R. [2 ]
Amrouch, Hussam [2 ]
Jung, Jinhyo [3 ]
Dave, Shail [4 ]
So, Hwisoo [3 ]
Lee, Kyongwoo [3 ]
Shrivastava, Aviral [4 ]
Lin, Ji-Yung [5 ,6 ]
Weckx, Pieter [5 ]
Mishra, Subrat [5 ]
Catthoor, Francky [5 ,6 ]
Biswas, Dwaipayan [5 ]
Kumar, Akash [1 ]
机构
[1] Tech Univ Dresden, CFAED, Chair Processor Design, Dresden, Germany
[2] Univ Stuttgart, Chair Semicond Test & Reliabil STAR, Stuttgart, Germany
[3] Yonsei Univ, Seoul, South Korea
[4] Arizona State Univ, Sch Comp & Augmented Intelligence, Tempe, AZ USA
[5] IMEC, Leuven, Belgium
[6] Katholieke Univ Leuven, Leuven, Belgium
来源
2023 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE | 2023年
基金
新加坡国家研究基金会; 美国国家科学基金会;
关键词
Aging; Cross-layer reliability; Device and circuit reliability; Dynamic reliability estimation; Error mitigation; Machine learning for systems; Task scheduling; Timing reliability; ENERGY; DROP;
D O I
10.23919/DATE56975.2023.10137182
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Due to technology scaling in modern computing platforms, the safety and reliability issues have increased tremendously, which often accelerate aging, lead to permanent faults, and cause unreliable execution of applications. Failure in some computing systems like avionics may cause catastrophic consequences. Therefore, managing reliability under all circumstances of stress and environmental changes is crucial in all abstraction layers, from application to transistor levels. Machine learning techniques are recently being employed for dynamic reliability estimation and optimization. They can adapt to varying workloads and system conditions. This paper presents reliability improvement approaches from multiple perspectives-from transistor-level to application-level-and discusses their effectiveness and limitations as well as open challenges.
引用
收藏
页数:10
相关论文
共 61 条
[1]   Learning-based BTI stress estimation and mitigation in multi-core processor systems [J].
Abbas, Haider Muhi ;
Halak, Basel ;
Zwolinski, Mark .
MICROPROCESSORS AND MICROSYSTEMS, 2021, 81
[2]   Unveiling the Impact of IR-Drop on Performance Gain in NCFET-Based Processors [J].
Amrouch, Hussam ;
Salamin, Sami ;
Pahwa, Girish ;
Gaidhane, Amol D. ;
Henkel, Joerg ;
Chauhan, Yogesh S. .
IEEE TRANSACTIONS ON ELECTRON DEVICES, 2019, 66 (07) :3215-3223
[3]  
Amrouch H, 2015, I SYMPOS LOW POWER E, P347, DOI 10.1109/ISLPED.2015.7273538
[4]  
Ballesteros A, 2017, IEEE INT C EMERG
[5]   Efficient Identification of Critical Faults in Memristor Crossbars for Deep Neural Networks [J].
Chen, Ching-Yuan ;
Chakrabarty, Krishnendu .
PROCEEDINGS OF THE 2021 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE 2021), 2021, :1074-1077
[6]   Using Machine Learning Techniques to Evaluate Multicore Soft Error Reliability [J].
da Rosa, Felipe Rocha ;
Garibotti, Rafael ;
Ost, Luciano ;
Reis, Ricardo .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2019, 66 (06) :2151-2164
[7]  
Das AK, 2018, EMBED SYST, P1, DOI 10.1007/978-3-319-69374-3
[8]   Adaptive and Hierarchical Runtime Manager for Energy-Aware Thermal Management of Embedded Systems [J].
Das, Anup ;
Al-Hashimi, Bashir M. ;
Merrett, Geoff V. .
ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2016, 15 (02)
[9]   Reinforcement learning-Based Inter- and Intra-Application Thermal Optimization for lifetime Improvement of Multicore Systems [J].
Das, Anup ;
Shafik, Rishad A. ;
Merrett, Geoff V. ;
Al-Hashimi, Bashir M. ;
Kumar, Akash ;
Veeravalli, Bharadwaj .
2014 51ST ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2014,
[10]  
Dave S., 2022, 40 IEEE VLSI TEST S, P1