Automating CPU Dynamic Thermal Control for High Performance Computing

被引:7
作者
Ali, Ghazanfar [1 ]
Wofford, Lowell [2 ]
Turner, Christopher [1 ]
Chen, Yong [1 ]
机构
[1] Texas Tech Univ, Lubbock, TX 79409 USA
[2] Los Alamos Natl Lab, Ultrascale Syst Res Ctr, Los Alamos, NM USA
来源
2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022) | 2022年
基金
美国国家科学基金会;
关键词
CPU Temperature; Automation; HPC; Data Center; Kraken; Dynamic Voltage and Frequency Scaling; Powersave; Performance; Dynamic Thermal Control; Redfish; DVFS; Computing Cluster Dynamic Thermal Control; Data Center Automation; High Performance Computing; DATA CENTERS; TEMPERATURE; IMPACT;
D O I
10.1109/CCGrid54584.2022.00061
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In a production high-performance computing (HPC) data center, numerous factors, including workload compute intensity, cooling infrastructure failure, and the use of economized cooling can substantially increase the CPU temperature. CPU thermal design-related studies have shown that slight variances in the operational temperature can significantly impact the lifetime, durability, and performance of a CPU. Therefore, it is critical to monitor and control the operating temperature of the CPU. In this study, we design an automated and continuous CPU thermal monitoring and control methodology to maintain and control a healthy CPU thermal state. This research utilizes the Redfish protocol to monitor the CPU temperature and dynamic voltage frequency scaling to control the temperature. We developed a reference implementation and evaluated our methodology using a cluster of 150 Raspberry Pi3 nodes. We performed extensive CPU thermal analyses in different scenarios. We analyzed how quickly a CPU can attain the maximum temperature under 100% load at room temperature. Based on our experiments, the temperature of a CPU with 100% load can increase to similar to 72 degrees C (161.6 degrees F) and similar to 86 degrees C (186.8 degrees F) with the lowest and highest CPU frequency configurations, respectively. We analyzed the impact of applying thermal control at eight temperature configurations on the thermal and frequency scaling behavior of a CPU. We observed that applying thermal control at lower temperature configurations (e.g., 70 degrees C (158 degrees F)) is a better configuration for healing an overheated CPU. As a result of the proposed model, the CPU operating at normal temperature will consume comparatively less energy, deliver higher performance, and augment its durability.
引用
收藏
页码:514 / 523
页数:10
相关论文
共 32 条
[1]  
[Anonymous], 2016, DAT CTR POW EQ THERM
[2]  
[Anonymous], 2020, HPC KRAKEN
[3]  
Anubhav Kumar, 2008, 2008 Second International Conference on Thermal Issues in Emerging Technologies, Theory and Applications (ThETA), P115, DOI 10.1109/THETA.2008.5167163
[4]  
Athavale J, 2018, ADV HEAT TRANSFER, V50, P123, DOI 10.1016/bs.aiht.2018.07.001
[5]  
AVELAR V., 2012, WHITE PAPER, V49
[6]   Thousand core chips-a technology perspective [J].
Borkar, Shekhar .
2007 44TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, VOLS 1 AND 2, 2007, :746-749
[7]   Understanding GPU Power: A Survey of Profiling, Modeling, and Simulation Methods [J].
Bridges, Robert A. ;
Imam, Neena ;
Mintz, Tiffany M. .
ACM COMPUTING SURVEYS, 2016, 49 (03)
[8]   Evaluating the Arrhenius equation for developmental processes [J].
Crapse, Joseph ;
Pappireddi, Nishant ;
Gupta, Meera ;
Shvartsman, Stanislav Y. ;
Wieschaus, Eric ;
Wuhr, Martin .
MOLECULAR SYSTEMS BIOLOGY, 2021, 17 (08)
[9]  
DMTF, 2020, DMTFS REDF
[10]   Liquid Cooling of Compute System [J].
Gullbrand, Jessica ;
Luckeroth, Mark J. ;
Sprenger, Mark E. ;
Winkel, Casey .
JOURNAL OF ELECTRONIC PACKAGING, 2019, 141 (01)