Modeling Expected Application Runtime for Characterizing and Assessing Job Performance Workshop paper: HPCMASPA 2018

被引:4
作者
Aaziz, Omar [1 ]
Cook, Jonathan [2 ]
Tanash, Mohammed [2 ]
机构
[1] Sandia Natl Labs, POB 5800, Albuquerque, NM 87185 USA
[2] New Mexico State Univ, Las Cruces, NM 88003 USA
来源
2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER) | 2018年
关键词
D O I
10.1109/CLUSTER.2018.00070
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we present a methodology for modeling the expected runtime of a job based on historical application data and data from the job itself. This estimation model is useful for both for HPC users and administrators as a metric to compare the actual job runtime to, thus establishing a measure of performance of the job. We used job data, system data, and hardware performance counters in a near-zero overhead manner to model and assess job performance, in particular whether or not the job runtime was in line with expectations from historical application performance. We show over three proxy applications and three real applications that our estimations are within 5% of actual performance.
引用
收藏
页码:543 / 551
页数:9
相关论文
共 19 条
[1]   HPCTOOLKIT: tools for performance analysis of optimized parallel programs [J].
Adhianto, L. ;
Banerjee, S. ;
Fagan, M. ;
Krentel, M. ;
Marin, G. ;
Mellor-Crummey, J. ;
Tallent, N. R. .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2010, 22 (06) :685-701
[2]   The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications [J].
Agelastos, Anthony ;
Allan, Benjamin ;
Brandt, Jim ;
Cassella, Paul ;
Enos, Jeremy ;
Fullop, Joshi ;
Gentile, Ann ;
Monk, Steve ;
Naksinehaboon, Nichamon ;
Ogden, Jeff ;
Rajan, Mahesh ;
Showerman, Michael ;
Stevenson, Joel ;
Taerat, Narate ;
Tucker, Tom .
SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, :154-165
[3]  
[Anonymous], 1999, P DEP DEF HPCMP US G
[4]  
[Anonymous], P SC13 INT C HIGH PE
[5]   Identifying the Culprits behind Network Congestion [J].
Bhatele, Abhinav ;
Titus, Andrew R. ;
Thiagarajan, Jayaraman J. ;
Jain, Nikhil ;
Gamblin, Todd ;
Bremer, Peer-Timo ;
Schulz, Martin ;
Kale, Laxmikant V. .
2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, :113-122
[6]  
Boehme D, 2016, SC '16: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, P550, DOI 10.1109/SC.2016.46
[7]  
Calotoiu A., 2013, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page, P45
[8]  
Duesterwald E, 2003, 12TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PROCEEDINGS, P220
[9]   Comprehensive Resource Use Monitoring for HPC Systems with TACC Stats [J].
Evans, Todd ;
Barth, William L. ;
Browne, James C. ;
DeLeon, Robert L. ;
Furlani, Thomas R. ;
Gallo, Steven M. ;
Jones, Matthew D. ;
Patra, Abani K. .
2014 1ST INTERNATIONAL WORKSHOP ON HPC USER SUPPORT TOOLS (HUST), 2014, :13-21
[10]   The Scalasca performance toolset architecture [J].
Geimer, Markus ;
Wolf, Felix ;
Wylie, Brian J. N. ;
Abraham, Erika ;
Becker, Daniel ;
Mohr, Bernd .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2010, 22 (06) :702-719