Experience with using the Parallel Workloads Archive

被引:184
作者
Feitelson, Dror G. [1 ]
Tsafrir, Dan [2 ]
Krakov, David [1 ]
机构
[1] Hebrew Univ Jerusalem, Dept Comp Sci, IL-91904 Jerusalem, Israel
[2] Technion Israel Inst Technol, Dept Comp Sci, IL-32000 Haifa, Israel
基金
以色列科学基金会;
关键词
Workload log; Data quality; Parallel job scheduling; DATA QUALITY; PERFORMANCE; PACKING;
D O I
10.1016/j.jpdc.2014.06.013
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Science is based upon observation. The scientific study of complex computer systems should therefore be based on observation of how they are used in practice, as opposed to how they are assumed to be used or how they were designed to be used. In particular, detailed workload logs from real computer systems are invaluable for research on performance evaluation and for designing new systems. Regrettably, workload data may suffer from quality issues that might distort the study results, just as scientific observations in other fields may suffer from measurement errors. The cumulative experience with the Parallel Workloads Archive, a repository of job-level usage data from large-scale parallel supercomputers, clusters, and grids, has exposed many such issues. Importantly, these issues were not anticipated when the data was collected, and uncovering them was not trivial. As the data in this archive is used in hundreds of studies, it is necessary to describe and debate procedures that may be used to improve its data quality. Specifically, we consider issues like missing data, inconsistent data, erroneous data, system configuration changes during the logging period, and unrepresentative user behavior. Some of these may be countered by filtering out the problematic data items. In other cases, being cognizant of the problems may affect the decision of which datasets to use. While grounded in the specific domain of parallel jobs, our findings and suggested procedures can also inform similar situations in other domains. (C) 2014 Elsevier Inc. All rights reserved.
引用
收藏
页码:2967 / 2982
页数:16
相关论文
共 48 条
[1]   APPROACH TO WORKLOAD CHARACTERIZATION PROBLEM [J].
AGRAWALA, AK ;
MOHR, JM ;
BRYANT, RM .
COMPUTER, 1976, 9 (06) :18-32
[2]  
ARONSSON M, 2007, 2399 SICS
[3]  
Chapin SJ, 1999, LECT NOTES COMPUT SC, V1659, P67
[4]   A comprehensive model of the supercomputer workload [J].
Cirne, W ;
Berman, F .
WWC-4: IEEE INTERNATIONAL WORKSHOP ON WORKLOAD CHARACTERIZATION, 2001, :140-148
[5]  
EMERAS J, 2013, THESIS GRENOBLE U
[6]   A Revival of Integrity Constraints for Data Cleaning [J].
Fan, Wenfei ;
Geerts, Floris ;
Jia, Xibei .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (02) :1522-1523
[7]  
Feitelson D. G., 2005, SIGACT News, V36, P122, DOI 10.1145/1052796.1052797
[8]  
Feitelson D.G., 2005, EXPT COMPUTER SCI NE
[9]  
Feitelson D.G., 1996, LECT NOTES COMPUTER, V1162, P89
[10]  
Feitelson D.G., 1995, LECT NOTES COMPUT SC, V949, P337, DOI DOI 10.1007/3-540-60153-8