CRState: checkpoint/restart of OpenCL program for in-kernel applications

被引:0
作者
Genlang Chen
Jiajian Zhang
Zufang Zhu
Qiangqiang Jiang
Hai Jiang
Chaoyi Pang
机构
[1] Zhejiang University Ningbo Research Institute,Zhejiang University Ningbo Institute of Technology
[2] Zhejiang University,College of Computer Science, Polytechnic Institute, College of Software Technology
[3] Arkansas State University,Department of Computer Science
来源
The Journal of Supercomputing | 2021年 / 77卷
关键词
Checkpoint/restart; Heterogeneous; OpenCL;
D O I
暂无
中图分类号
学科分类号
摘要
The checkpoint/restart mechanism is critical in a preemptive system because clusters with this mechanism will be improved in terms of fault tolerance, load balance, and resource utilization. As graphics processing units (GPUs) have more recently become commonplace with the advent of general-purpose computation, and open computing language (OpenCL) programs are portable across various CPUs and GPUs, it is increasingly important to set up checkpoint/restart mechanism in OpenCL programs. However, due to the complexity of the internal computational state of the GPU, there is currently no effective and reasonable checkpoint/restart scheme for OpenCL applications. This paper proposes a feasible system, checkpoint/restart state (CRState), to achieve checkpoint/restart in GPU kernels. The computation states including heap, data segments, local memory, stack and code segments in the underlying hardware are identified and concretized in order to establish an association between the underlying level state and the application level representation. Then, a pre-compiler is developed to insert primitives into OpenCL programs at compile time so that major components of the computation state will be extracted at runtime. Since the computation state is duplicated at application level, such OpenCL programs can be preempted and ported across heterogeneous devices. A comprehensive example and ten authoritative benchmark programs are selected to demonstrate the feasibility and effectiveness of the proposed system.
引用
收藏
页码:5426 / 5467
页数:41
相关论文
共 39 条
  • [1] Arora R(2011)A technique for non-invasive application-level checkpointing J Supercomput 57 227-255
  • [2] Bangalore P(2008)Principles of operation: type 701 and associated equipment (from ibm manual) Annals of the history of computing 5 164-166
  • [3] Mernik M(2000)A kernel integrated task migration infrastructure for clusters of workstations Comput Electr Eng 26 279-295
  • [4] Bitsavers AK(2007)Cell broadband engine architecture and its first implementation–a performance view Ibm J Res Dev 51 559-572
  • [5] Bozyigit M(1972)B72–26 computer organization and the system/370 IEEE Trans Comput C–21 1458-1459
  • [6] Al-Tawil K(2002)Condor-g: a computation management agent for multi-institutional grids Cluster Comput 5 237-246
  • [7] Naseer S(2013)A checkpoint/restart scheme for cuda programs with complex computation states Ijndc 1 196-265
  • [8] Chen T(2018)Mitigation technique for performance degradation of virtual machine owing to gpu pass-through in fog computing J Commun Netw 20 257-17:24
  • [9] Raghavan R(2018)Automatic software repair: a bibliography ACM Comput Surv 51 17:1-113
  • [10] Dale JN(2007)A survey of general-purpose computation on graphics hardware Comput Gr Forum 26 80-899