Fault tolerance in HPC scientific workflow application

被引:0
作者
Li Y. [1 ,2 ]
Mo Z. [2 ]
Xiao Y. [1 ]
Zhao S. [1 ]
Duan B. [1 ]
机构
[1] Institute of Computer Application, Chinese Academy of Engineering Physics, Mianyang
[2] Institute of Applied Physics and Computational Mathematics, Beijing
来源
Mo, Zeyao (zeyao_mo@iapcm.ac.cn) | 2020年 / National University of Defense Technology卷 / 42期
关键词
Decision tree model; Fault tolerance; Scientific workflow; Workflow engine;
D O I
10.11887/j.cn.202006010
中图分类号
学科分类号
摘要
Scientific workflow technologies in HPC are extensively applied in scientific research and engineering simulation domain. Application such as numerical simulation in complex multi-physics problems and multi-stages data process need software to compose an automatic executable workflow to increase the efficiency. There are lots of exceptions such as resource failure, task configurations errors which may cause the workflow execution to be ceased, therefore robust and continuous execution is important for workflow application. A taxonomy of fault tolerance in workflow was made and some fault tolerance techniques in typical workflow systems were reviewed. A decision-tree based event-condition-action fault tolerance model was proposed, and a non-intrusive extendable framework which was implemented in our HPC scientific workflow system HSWAP was designed. Runtime configurable error recovery strategies were also implemented in our fault tolerance software module. In order to validate our new model and framework, the fault tolerance functions were tested in real engineering simulation project. Results show that fault tolerance plays an important role in increasing workflow execution efficiency. © 2020, NUDT Press. All right reserved.
引用
收藏
页码:82 / 89
页数:7
相关论文
共 18 条
  • [1] ZHANG Weimin, LIU Cancan, LUO Zhigang, A review on scientific workflows, Journal of National University of Defense Technology, 33, 3, pp. 56-65, (2011)
  • [2] Dauwe D, Pasricha S, Maciejewski A, Et al., An analysis of resilience techniques for exascale computing platforms, Proceedings of IEEE International Parallel & Distributed Processing Symposium: Workshops, (2017)
  • [3] Zhao Y, Xiong Y H, Lee E A, Et al., The design and application of structured types in ptolemy II, International Journal of Intelligent Systems, 25, 2, pp. 118-136, (2010)
  • [4] Ludascher B, Altintas I, Berkley C, Et al., Scientific workflow management and the Kepler system, Concurrency and Computation: Practice & Experience, 18, pp. 1039-1065, (2006)
  • [5] Deelman E, Singh G, Su M H, Et al., Pegasus: a framework for mapping complex scientific workflows onto distributed systems, Scientific Programming, 13, 3, pp. 219-237, (2005)
  • [6] Wolstencroft K, Haines R, Fellows D, Et al., The taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud, Nucleic Acids Research, 41, pp. 1-5, (2013)
  • [7] David C, Gabor G, Andrew H, Et al., Programming scientific and distributed workflow with Triana services, Concurrency and Computation Practice and Experience, 18, 10, pp. 1021-1037, (2006)
  • [8] ZHAO Shicao, XIAO Yonghao, DUAN Bowen, Et al., HSWAP: numerical simulation workflow management platform suitable for high performance computing environment, Journal of Computer Applications, 39, 6, pp. 1569-1576, (2019)
  • [9] LI Yufeng, MO Zeyao, XIAO Yonghao, Et al., Engine design and resource scheduling of scientific workflow application platform in supercomputing, Application Research of Computers, 36, 6, pp. 1723-1726, (2019)
  • [10] Lackovic M, Talia D, Tolosana-Calasanz R, Et al., A taxonomy for the analysis of scientific workflow faults, Proceedings of 13th IEEE International Conference on Computational Science and Engineering, (2010)