The RECIPE approach to challenges in deeply heterogeneous high performance systems

被引:12
作者
Agosta, Giovanni [1 ]
Fornaciari, William [1 ]
Atienza, David [2 ]
Canal, Ramon [3 ,4 ]
Cilardo, Alessandro [5 ]
Flich Cardo, Jose [6 ]
Hernandez Luz, Carles [6 ]
Kulczewski, Michal [7 ]
Massari, Giuseppe [1 ]
Tornero Gavila, Rafael [6 ]
Zapater, Marina [2 ]
机构
[1] Politecn Milan, Milan, Italy
[2] Ecole Polytech Fed Lausanne, Lausanne, Switzerland
[3] Barcelona Supercomp Ctr, Barcelona, Spain
[4] Univ Politecn Cataluna, Barcelona, Spain
[5] Univ Naples Federico II, CeRICT, Naples, Italy
[6] Univ Politecn Valencia, Valencia, Spain
[7] Poznan Supercomp & Networking Ctr, Poznan, Poland
关键词
HPC; Heterogeneous computing; Run-time management;
D O I
10.1016/j.micpro.2020.103185
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project. In particular, the need for predictive reliability approaches to maximizing hardware lifetime and guarantee application performance is identified as the key concern for RECIPE. We address it through hierarchical resource management of the heterogeneous architectural components of the system, driven by estimates of the application latency and hardware reliability obtained respectively through timing analysis and modeling thermal properties and mean-time-to-failure of subsystems. We show the impact of prediction accuracy on the overheads imposed by the checkpointing policy, as well as a possible application to a weather forecasting use case. (c) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页数:13
相关论文
共 44 条
[1]   Measurement-Based Worst-Case Execution Time Estimation Using the Coefficient of Variation [J].
Abella, Jaume ;
Padilla, Maria ;
Del Castillo, Joan ;
Cazorla, Francisco J. .
ACM TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS, 2017, 22 (04)
[2]   Managing Heterogeneous Resources in HPC Systems [J].
Agosta, Giovanni ;
Fornaciari, William ;
Massari, Giuseppe ;
Pupykina, Anna ;
Reghenzani, Federico ;
Zanella, Michele .
PARMA-DITAM 2018: 9TH WORKSHOP ON PARALLEL PROGRAMMING AND RUNTIME MANAGEMENT TECHNIQUES FOR MANY-CORE ARCHITECTURES AND 7TH WORKSHOP ON DESIGN TOOLS AND ARCHITECTURES FOR MULTICORE EMBEDDED COMPUTING PLATFORMS, 2018, :7-12
[3]  
[Anonymous], 2015, SUSTAINED SIMULATION
[4]  
[Anonymous], 2017, TECHNICAL REPORT
[5]   Effective Runtime Resource Management Using Linux Control Groups with the BarbequeRTRM Framework [J].
Bellasi, Patrick ;
Massari, Giuseppe ;
Fornaciari, William .
ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2015, 14 (02)
[6]   Intel® Omni-Path Architecture Enabling Scalable, High Performance Fabrics [J].
Birrittella, Mark S. ;
Debbage, Mark ;
Huggahalli, Ram ;
Kunz, James ;
Lovett, Tom ;
Rimmer, Todd ;
Underwood, Keith D. ;
Zak, Robert C. .
PROCEEDINGS 2015 IEEE 23RD ANNUAL SYMPOSIUM ON HIGH-PERFORMANCE INTERCONNECTS - HOTI 2015, 2015, :1-9
[7]   A survey of fault tolerant methodologies for FPGAs [J].
Cheatham, Jason A. ;
Emmert, John M. ;
Baumgart, Stan .
ACM TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS, 2006, 11 (02) :501-533
[8]  
Chou CL, 2011, DES AUT TEST EUROPE, P673
[9]  
Cilardo A, 2015, DES AUT TEST EUROPE, P163
[10]   Measurement-Based Probabilistic Timing Analysis for Multi-path Programs [J].
Cucu-Grosjean, Liliana ;
Santinelli, Luca ;
Houston, Michael ;
Lo, Code ;
Vardanega, Tullio ;
Kosmidis, Leonidas ;
Abella, Jaume ;
Mezzetti, Enrico ;
Quinones, Eduardo ;
Cazorla, Francisco J. .
PROCEEDINGS OF THE 24TH EUROMICRO CONFERENCE ON REAL-TIME SYSTEMS (ECRTS 2012), 2012, :91-101