Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures

被引:39
作者
Sankar, Sriram [1 ]
Shaw, Mark [1 ]
Vaid, Kushagra [1 ]
Gurumurthi, Sudhanva [2 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
[2] Univ Virginia, Dept Comp Sci, Charlottesville, VA 22903 USA
关键词
Design; Experimentation; Reliability; Datacenter; hard disk drives; temperature impact;
D O I
10.1145/2491472.2491475
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With the advent of cloud computing and online services, large enterprises rely heavily on their datacenters to serve end users. A large datacenter facility incurs increased maintenance costs in addition to service unavailability when there are increased failures. Among different server components, hard disk drives are known to contribute significantly to server failures; however, there is very little understanding of the major determinants of disk failures in datacenters. In this work, we focus on the interrelationship between temperature, workload, and hard disk drive failures in a large scale datacenter. We present a dense storage case study from a population housing thousands of servers and tens of thousands of disk drives, hosting a large-scale online service at Microsoft. We specifically establish correlation between temperatures and failures observed at different location granularities: (a) inside drive locations in a server chassis, (b) across server locations in a rack, and (c) across multiple racks in a datacenter. We show that temperature exhibits a stronger correlation to failures than the correlation of disk utilization with drive failures. We establish that variations in temperature are not significant in datacenters and have little impact on failures. We also explore workload impacts on temperature and disk failures and show that the impact of workload is not significant. We then experimentally evaluate knobs that control disk drive temperature, including workload and chassis design knobs. We corroborate our findings from the real data study and show that workload knobs show minimal impact on temperature. Chassis knobs like disk placement and fan speeds have a larger impact on temperature. Finally, we also show the proposed cost benefit of temperature optimizations that increase hard disk drive reliability.
引用
收藏
页数:24
相关论文
共 30 条
  • [1] [Anonymous], P 1 ACM S CLOUD COMP
  • [2] [Anonymous], P 11 INT JOINT C MEA
  • [3] Cole G, 2000, TP3381 SEAG, P1
  • [4] El-Sayed, 2012, P 12 ACM SIGMETRICS
  • [5] Server class disk drives: How reliable are they?
    Elerath, JG
    Shah, S
    [J]. ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM, 2004 PROCEEDINGS, 2004, : 151 - 156
  • [6] Facebook, 2011, OP COMP PROJ FAC
  • [7] Govindan M. S. S., 2009, P WORKSH EN EFF DES
  • [8] GRAY J, 2005, MSRTR2005166 MICR RE
  • [9] Greenberg S., 2006, ACEEE SUMM STUD EN E
  • [10] Feedforward control for reducing disk-flutter-induced track misregistration
    Guo, GX
    Zhang, JL
    [J]. IEEE TRANSACTIONS ON MAGNETICS, 2003, 39 (04) : 2103 - 2108