共 81 条
[21]
Learning from failure across multiple clusters: A trace-driven approach to understanding, predicting, and mitigating job terminations
[J].
2017 IEEE 37TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2017),
2017,
:1333-1344
[22]
El-Sayed N, 2013, I C DEPEND SYS NETWO
[23]
Fu XY, 2014, IEEE INT C CL COMP, P103, DOI 10.1109/CLUSTER.2014.6968768
[24]
LogMaster: Mining Event Correlations in Logs of Large-scale Cluster Systems
[J].
2012 31ST INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS 2012),
2012,
:71-80
[25]
Gainaru A., 2012, SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, P1, DOI [10.1109/SC.2012.57, DOI 10.1109/SC.2012.57, 10.1109/SC. 2012.57]
[26]
Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems
[J].
2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS),
2012,
:1168-1179
[27]
Gainaru A, 2011, LECT NOTES COMPUT SC, V6852, P52, DOI 10.1007/978-3-642-23400-2_6
[28]
A Practical Approach to Hard Disk Failure Prediction in Cloud Platforms Big Data Model for Failure Management in Datacenters
[J].
PROCEEDINGS 2016 IEEE SECOND INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (BIGDATASERVICE 2016),
2016,
:105-116
[30]
GHAHRAMANI Z, 2001, HIDDEN MARKOV MODELS, V15, P9