共 15 条
[1]
The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications
[J].
SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS,
2014,
:154-165
[2]
Amvrosiadis G, 2018, PROCEEDINGS OF THE 2018 USENIX ANNUAL TECHNICAL CONFERENCE, P533
[3]
Bass N., DEAC5207NA27344 US D
[4]
Evolution of Monitoring Over the Lifetime of a High Performance Computing Cluster
[J].
2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015,
2015,
:710-713
[5]
Lessons Learned From the Analysis of System Failures at Petascale: The Case of Blue Waters
[J].
2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN),
2014,
:610-621
[6]
LOGAIDER: A tool for mining potential correlations of HPC log events
[J].
2017 17TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID),
2017,
:442-451
[7]
Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems
[J].
2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS),
2012,
:1168-1179
[8]
Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications
[J].
SC'17: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS,
2017,
[9]
Jha S., 2015, P 5 WORKSH FAULT TOL, P11
[10]
What supercomputers say: A study of five system logs
[J].
37TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS,
2007,
:575-+