Reliability-oriented resource management for High-Performance Computing

被引:3
作者
Massari, Giuseppe [1 ]
Peta, Miriam [1 ]
Campi, Alessandro [1 ]
Reghenzani, Federico [1 ]
Terraneo, Federico [1 ]
Agosta, Giovanni [1 ]
Fornaciari, William [1 ]
Ciesielski, Sebastian [2 ]
Kulczewski, Michal [2 ]
Piatek, Wojciech [2 ]
机构
[1] DEIB Politecn Milano, Via G Ponzio 34-5, Milan, Italy
[2] Poznan Supercomp & Networking Ctr, St Jana Pawla II 10, Poznan, Poland
基金
欧盟地平线“2020”;
关键词
Reliability; HPC; Distributed systems; Resource management; Software simulators; Thermal management;
D O I
10.1016/j.suscom.2023.100873
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Reliability is an increasingly pressing issue for High-Performance Computing systems, as failures are a threat to large-scale applications, for which an even single run may incur significant energy and billing costs. Currently, application developers need to address reliability explicitly, by integrating application-specific checkpoint/restore mechanisms. However, the application alone cannot exploit system knowledge, which is not the case for system-wide resource management systems. In this paper, we propose a reliability-oriented policy that can increase significantly component reliability by combining checkpoint/restore mechanisms exploitation and proactive resource management policies.
引用
收藏
页数:11
相关论文
共 27 条
[1]   The RECIPE approach to challenges in deeply heterogeneous high performance systems [J].
Agosta, Giovanni ;
Fornaciari, William ;
Atienza, David ;
Canal, Ramon ;
Cilardo, Alessandro ;
Flich Cardo, Jose ;
Hernandez Luz, Carles ;
Kulczewski, Michal ;
Massari, Giuseppe ;
Tornero Gavila, Rafael ;
Zapater, Marina .
MICROPROCESSORS AND MICROSYSTEMS, 2020, 77
[2]   The Italian Research on HPC Key Technologies across EuroHPC Invited Paper [J].
Aldinucci, Marco ;
Agosta, Giovanni ;
Andreini, Antonio ;
Ardagna, Claudio A. ;
Bartolini, Andrea ;
Cilardo, Alessandro ;
Cosenza, Biagio ;
Danelutto, Marco ;
Esposito, Roberto ;
Fornaciari, William ;
Giorgi, Roberto ;
Lengani, Davide ;
Montella, Raffaele ;
Olivieri, Mauro ;
Saponara, Sergio ;
Simoni, Daniele ;
Torquati, Massimo .
PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2021 (CF 2021), 2021, :178-184
[3]   Effective Runtime Resource Management Using Linux Control Groups with the BarbequeRTRM Framework [J].
Bellasi, Patrick ;
Massari, Giuseppe ;
Fornaciari, William .
ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2015, 14 (02)
[4]  
Berberich Florian, 2019, 2019 15th International Conference on eScience (eScience). Proceedings, P471, DOI 10.1109/eScience.2019.00062
[5]  
Bernstein J., 2014, Reliability prediction from burn-in data fit to reliability models
[6]   Thermal-Aware Scheduling in Green Data [J].
Chaudhry, Muhammad Tayyab ;
Ling, Teck Chaw ;
Manzoor, Atif ;
Hussain, Syed Asad ;
Kim, Jongwon .
ACM COMPUTING SURVEYS, 2015, 47 (03)
[7]   Reliable Power and Time-Constraints-Aware Predictive Management of Heterogeneous Exascale Systems [J].
Fornaciari, William ;
Agosta, Giovanni ;
Atienza, David ;
Brandolese, Carlo ;
Cammoun, Leila ;
Cremona, Luca ;
Cilardo, Alessandro ;
Farres, Albert ;
Flich, Jose ;
Hernandez, Carles ;
Kulchewski, Michal ;
Libutti, Simone ;
Maria Martinez, Jose ;
Massari, Giuseppe ;
Oleksiak, Ariel ;
Pupykina, Anna ;
Reghenzani, Federico ;
Tornero, Rafael ;
Zanella, Michele ;
Zapater, Marina ;
Zoni, Davide .
2018 INTERNATIONAL CONFERENCE ON EMBEDDED COMPUTER SYSTEMS: ARCHITECTURES, MODELING, AND SIMULATION (SAMOS XVIII), 2018, :187-194
[8]   Reliability-Aware Resource Allocation in HPC Systems [J].
Gottumukkala, Narasimha Raju ;
Leangsuksun, Chokchai Box ;
Taerat, Narate ;
Nassar, Raja ;
Scott, Stephen L. .
2007 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2007, :312-+
[9]  
Huang L, 2009, DES AUT TEST EUROPE, P51
[10]  
Iranfar A, 2017, INTERNATIONAL CONFERENCE ON EMBEDDED COMPUTER SYSTEMS: ARCHITECTURES, MODELING, AND SIMULATION (SAMOS 2017), P286, DOI 10.1109/SAMOS.2017.8344642