Near-Optimal Thermal Monitoring Framework for Many-Core Systems-on-Chip

被引:21
作者
Ranieri, Juri [1 ]
Vincenzi, Alessandro [2 ]
Chebira, Amina [3 ]
Atienza, David [2 ]
Vetterli, Martin [1 ]
机构
[1] Ecole Polytech Fed Lausanne, Sch Comp & Commun Sci, CH-1015 Lausanne, Switzerland
[2] Ecole Polytech Fed Lausanne, Sch Engn, CH-1015 Lausanne, Switzerland
[3] CSEM, CH-2002 Neuchtel, Switzerland
基金
欧洲研究理事会;
关键词
Sensor placement; thermal management; thermal monitoring; TEMPERATURE SENSOR ALLOCATION; MULTIPROCESSOR SOCS; MANAGEMENT; MICROPROCESSORS; ARCHITECTURES; PLACEMENT; DESIGN;
D O I
10.1109/TC.2015.2395423
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Chip designers place on-chip thermal sensors to measure local temperatures, thus preventing thermal runaway situations in many-core processing architectures. However, the quality of the thermal reconstruction is directly dependent on the number of placed sensors, which should be minimized, while guaranteeing full detection of all the worst case temperature gradient. In this paper, we present an entire framework for the thermal management of complex many-core architectures, such that we can precisely recover the thermal distribution from a minimal number of sensors. The proposed sensor placement algorithm is guaranteed to reduce the impact of noisy measurements on the reconstructed thermal distribution. We achieve significant improvements compared to the state of the art, in terms of both computational complexity and reconstruction precision. For example, if we consider a 64 cores systems-on-chip with 64 noisy sensors (sigma(2) = 4), we achieve an average reconstruction error of 1.5 degrees C, that is less than half of what previous state-of-the-art methods achieve. We also study the practical limits of the proposed method and show that we do not need realistic workloads to learn the model and efficiently place the sensors. In fact, we show that the reconstruction error is not significantly increased if we randomly generate the power-traces of the components or if we have just a part of the correct workload.
引用
收藏
页码:3197 / 3209
页数:13
相关论文
共 34 条
[1]  
ANSYS Inc, ANSYS CFX
[2]  
Ardestani EK, 2013, INT S HIGH PERF COMP, P448, DOI 10.1109/HPCA.2013.6522340
[3]   Sampling in Thermal Simulation of Processors: Measurement, Characterization, and Evaluation [J].
Ardestani, Ehsan K. ;
Mesa-Martinez, Francisco J. ;
Southern, Gabriel ;
Ebrahimi, Elnaz ;
Renau, Jose .
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2013, 32 (08) :1187-1200
[4]  
Ardestani EK., 2012, P 2012 ACMIEEE INT S, P33
[5]  
BENINI L, 2012, DESIGN AUTOMATION TE, V2012, P983
[6]   Design challenges of technology scaling [J].
Borkar, S .
IEEE MICRO, 1999, 19 (04) :23-29
[7]   Power, thermal, and reliability modeling in nanometer-scale microprocessors [J].
Brooks, David ;
Dick, Robert P. ;
Joseph, Russ ;
Shang, Li .
IEEE MICRO, 2007, 27 (03) :49-62
[8]  
Chen Z., 2013, IEEE T COMMUN UNPUB
[9]  
Cochran R, 2009, DES AUT CON, P478
[10]   Static and dynamic temperature-aware scheduling for multiprocessor SoCs [J].
Coskun, Ayse Kivilcim ;
Rosing, Tajana Simunic ;
Whisnant, Keith A. ;
Gross, Kenny C. .
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2008, 16 (09) :1127-1140