Modeling Application Resilience in Large-scale Parallel Execution

被引:0
|
作者
Wu, Kai [1 ]
Dong, Wenqian [1 ]
Guan, Qiang [2 ]
DeBardeleben, Nathan [3 ]
Li, Dong [1 ]
机构
[1] Univ Calif Merced, Merced, CA 95343 USA
[2] Kent State Univ, Kent, OH 44242 USA
[3] Los Alamos Natl Lab, Washington, DC USA
基金
美国国家科学基金会;
关键词
D O I
10.1145/3225058.3225119
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Understanding how the application is resilient to hardware and software errors is critical to high-performance computing. To evaluate application resilience, the application level fault injection is the most common method. However, the application level fault injection can be very expensive when running the application in parallel in large scales due to the high requirement for hardware resource during fault injection. In this paper, we introduce a new methodology to evaluate the resilience of the application running in large scales. Instead of injecting errors into the application in large-scale execution, we inject errors into the application in small-scale execution and serial execution to model and predict the fault injection result for the application running in large scales. Our models are based on a series of empirical observations. Those observations characterize error occurrences and propagation across MPI processes in small-scale execution (including serial execution) and large-scale one. Our models achieve high prediction accuracy. Evaluating with four NAS parallel benchmarks and two proxy scientific applications, we demonstrate that using the fault injection result to predict for 64 MPI processes, the average prediction error is 8%. While using the fault injection result to make the same prediction for eight MPI processes, the average prediction error decreases to 7%.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] A Flexible Strategy for Distributed and Parallel Execution of a Monolithic Large-Scale Sequential Application
    Navarro, Felipe
    Gonzalez, Carlos
    Peredo, Oscar
    Morales, Gerson
    Egana, Alvaro
    Ortiz, Julian M.
    HIGH PERFORMANCE COMPUTING, CARLA 2014, 2014, 485 : 54 - 67
  • [2] Parallel genesis for large-scale modeling
    Goddard, NH
    Hood, G
    COMPUTATIONAL NEUROSCIENCE: TRENDS IN RESEARCH, 1997, 1997, : 911 - 917
  • [3] Application representations for multiparadigm performance modeling of large-scale parallel scientific codes
    Adve, V
    Sakellariou, R
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2000, 14 (04): : 304 - 316
  • [4] Research on Parallel Large-Scale Terrain Modeling for Visualization
    Xiao, Luhao
    Gong, Guanghong
    THEORY, METHODOLOGY, TOOLS AND APPLICATIONS FOR MODELING AND SIMULATION OF COMPLEX SYSTEMS, PT I, 2016, 643 : 387 - 397
  • [5] Research on the scalability of the large-scale parallel application programs
    Chen, Jun
    Mo, Zeyao
    Li, Xiaomei
    Yuan, Guoxing
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2000, 37 (11): : 1382 - 1388
  • [6] Selection and Execution of large-scale projects
    Ahrens, G. -A.
    Beckmann, K. J.
    Boltze, M.
    Eisenkopf, A.
    Fricke, H.
    Knieps, G.
    Knorr, A.
    Mitusch, K.
    Oeter, S.
    Radermacher, F. -J
    Sieg, G.
    Siegmann, J.
    Schlag, B.
    Stoelzle, W.
    Vallee, D.
    Winner, H.
    BAUINGENIEUR, 2015, 90 : 129 - 139
  • [7] A data parallel approach for large-scale Gaussian process modeling
    Choudhury, A
    Nair, PB
    Keane, AJ
    PROCEEDINGS OF THE SECOND SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2002, : 95 - 111
  • [8] MERPSYS: An environment for simulation of parallel application execution on large scale HPC systems
    Czarnul, Pawel
    Kuchta, Jaroslaw
    Matuszek, Mariusz
    Proficz, Jerzy
    Rosciszewski, Pawel
    Wojcik, Michal
    Szymanski, Julian
    SIMULATION MODELLING PRACTICE AND THEORY, 2017, 77 : 124 - 140
  • [9] Modeling research on manufacturing execution system based on large-scale system cybernetics
    Wu Y.
    Xu X.-D.
    Li C.-X.
    J. Shanghai Jiaotong Univ. Sci., 2008, 6 (744-747): : 744 - 747
  • [10] Large-scale parallel execution of urban-scale traffic simulation and its performance on K computer
    Daigo Umemoto
    Nobuyasu Ito
    Journal of Computational Social Science, 2019, 2 : 97 - 101