Lessons Learned Implementing User-Level Failure Mitigation in MPICH

被引:12
作者
Bland, Wesley [1 ]
Lu, Huiwei [1 ]
Seo, Sangmin [1 ]
Balaji, Pavan [1 ]
机构
[1] Argonne Natl Lab, Div Math & Comp Sci, 9700 S Cass Ave, Argonne, IL 60439 USA
来源
2015 15TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING | 2015年
关键词
D O I
10.1109/CCGrid.2015.51
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
User-level failure mitigation (ULFM) is becoming the front-running solution for process fault tolerance in MPI. While not yet adopted into the MPI standard, it is being used by applications and libraries and is being considered by the MPI Forum for future inclusion into MPI itself. In this paper, we introduce an implementation of ULFM in MPICH, a high-performance and widely portable implementation of the MPI standard. We demonstrate that while still a reference implementation, the runtime cost of the new API calls introduced is relatively low.
引用
收藏
页码:1123 / 1126
页数:4
相关论文
共 6 条
[1]  
[Anonymous], 2012, EUR MPI US GROUP M
[2]  
[Anonymous], P INT C HIGH PERF CO
[3]  
Balaji P., 2014, Mpich user's guide
[4]  
FAGG GE, 2000, FT MPI FAULT TOLERAN, P346
[5]  
HASSANI A, 2014, DEP SYST NETW DSN 20, P750
[6]  
Thakur R, 2003, LECT NOTES COMPUT SC, V2840, P257