Improving resource utilization and fault tolerance in large simulations via actors

被引:3
作者
Klenk, Kyle [1 ]
Spiteri, Raymond J. [1 ]
机构
[1] Univ Saskatchewan, Dept Comp Sci, Saskatoon, SK S7N 5C9, Canada
来源
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS | 2024年 / 27卷 / 05期
基金
加拿大自然科学与工程研究理事会;
关键词
Actor model of concurrent computation; Scalability; resource utilization; Fault tolerance; Scientific computing; High-performance computing; SUMMA model; PERFORMANCE; SYSTEMS; SCALA;
D O I
10.1007/s10586-024-04318-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large simulations with many independent sub-simulations are common in scientific computing. There are numerous challenges, however, associated with performing such simulations in shared computing environments. For example, sub-simulations may have wildly varying completion times or not complete at all, leading to unpredictable runtimes as well as unbalanced and inefficient use of human and computational resources. In this study, we use the actor model of concurrent computation to improve both the resource utilization and fault tolerance for large-scale scientific computing simulations. More specifically, we use actors in the SUMMA model to manage a large-scale hydrological simulation over the North American continent with over 500,000 independent sub-simulations. We find that the actors implementation outperforms a standard array job submission as well as the job submission tool GNU Parallel by better balancing the computational load across processors. The actors implementation also improves fault tolerance and can eliminate the user intervention required to detect and re-submit failed jobs.
引用
收藏
页码:6323 / 6340
页数:18
相关论文
共 27 条
[1]   CONCURRENT OBJECT-ORIENTED PROGRAMMING [J].
AGHA, G .
COMMUNICATIONS OF THE ACM, 1990, 33 (09) :125-141
[2]  
Agha G. A, 1985, ACTORS MODEL CONCURR
[3]   BOINC: A Platform for Volunteer Computing [J].
Anderson, David P. .
JOURNAL OF GRID COMPUTING, 2020, 18 (01) :99-122
[4]  
Armstrong J., 1996, Proceedings of the symposium on industrial applications of Prolog, P16
[5]   Parsl: Pervasive Parallel Programming in Python']Python [J].
Babuji, Yadu ;
Woodard, Anna ;
Li, Zhuozhao ;
Katz, Daniel S. ;
Clifford, Ben ;
Kumar, Rohan ;
Lacinski, Lukasz ;
Chard, Ryan ;
Wozniak, Justin M. ;
Foster, Ian ;
Wilde, Michael ;
Chard, Kyle .
HPDC'19: PROCEEDINGS OF THE 28TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2019, :25-36
[6]  
Balis B, 2016, COMPUT INFORM, V35, P870
[7]   Revisiting actor programming in C plus [J].
Charousset, Dominik ;
Hiesgen, Raphael ;
Schmidt, Thomas C. .
COMPUTER LANGUAGES SYSTEMS & STRUCTURES, 2016, 45 :105-131
[8]   A unified approach for process-based hydrologic modeling: 1. Modeling concept [J].
Clark, Martyn P. ;
Nijssen, Bart ;
Lundquist, Jessica D. ;
Kavetski, Dmitri ;
Rupp, David E. ;
Woods, Ross A. ;
Freer, Jim E. ;
Gutmann, Ethan D. ;
Wood, Andrew W. ;
Brekke, Levi D. ;
Arnold, Jeffrey R. ;
Gochis, David J. ;
Rasmussen, Roy M. .
WATER RESOURCES RESEARCH, 2015, 51 (04) :2498-2514
[9]   Stride: A flexible software platform for high-performance ultrasound computed tomography [J].
Cueto, Carlos ;
Bates, Oscar ;
Strong, George ;
Cudeiro, Javier ;
Luporini, Fabio ;
Agudo, Oscar Calderon ;
Gorman, Gerard ;
Guasch, Lluis ;
Tang, Meng-Xing .
COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2022, 221
[10]   43 Years of Actors: A Taxonomy of Actor Models and Their Key Properties [J].
De Koster, Joeri ;
Van Cutsem, Tom ;
De Meuter, Wolfgang .
PROCEEDINGS OF THE 6TH INTERNATIONAL WORKSHOP ON PROGRAMMING BASED ON ACTORS, AGENTS, AND DECENTRALIZED CONTROL (AGERE'16), 2016, :31-40