Automatic Monitoring of Large-Scale Computing Infrastructure

被引:0
作者
Kim, Bockjoo [1 ]
Bourilkov, Dimitri [1 ]
机构
[1] Univ Florida, Dept Phys, Gainesville, FL 32611 USA
来源
26TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS, CHEP 2023 | 2024年 / 295卷
关键词
D O I
10.1051/epjconf/202429507007
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Modern distributed computing systems produce large amounts of monitoring data. For these systems to operate smoothly, under-performing or failing components must be identified quickly, and preferably automatically, enabling the system managers to react accordingly. In this contribution, we analyze jobs and transfer data collected in the running of the LHC computing infrastructure. The monitoring data is harvested from the Elasticsearch database and converted to formats suitable for further processing. Based on various machine and deep learning techniques, we develop automatic tools for continuous monitoring of the health of the underlying systems. Our initial implementation is based on publicly available deep learning tools, PyTorch or TensorFlow packages, running on state-of-the-art GPU systems.
引用
收藏
页数:7
相关论文
共 50 条
[41]   Large-scale volunteer computing over the Internet [J].
Costa, Fernando ;
Silva, Joao Nuno ;
Veiga, Luis ;
Ferreira, Paulo .
JOURNAL OF INTERNET SERVICES AND APPLICATIONS, 2012, 3 (03) :329-346
[42]   Superconducting Computing in Large-Scale Hybrid Systems [J].
Holmes, D. Scott ;
Kadin, Alan M. ;
Johnson, Mark W. .
COMPUTER, 2015, 48 (12) :34-42
[43]   Large-Scale Reconfigurable Computing in a Microsoft Datacenter [J].
Putnam, Andrew .
2014 IEEE HOT CHIPS 26 SYMPOSIUM (HCS), 2014,
[44]   Advanced learning for large-scale heterogeneous computing [J].
Zou, Quan ;
Liu, Wei ;
Merler, Michele ;
Ji, Rongrong .
NEUROCOMPUTING, 2016, 217 :1-2
[45]   THE PRINCIPLES OF LARGE-SCALE COMPUTING MACHINES - INTRODUCTION [J].
WILLIAMS, MR .
ANNALS OF THE HISTORY OF COMPUTING, 1989, 10 (04) :243-245
[46]   LATTICE QCD - A CHALLENGE IN LARGE-SCALE COMPUTING [J].
SCHILLING, K .
COMPUTER PHYSICS COMMUNICATIONS, 1987, 44 (03) :261-269
[47]   The huge carbon footprint of large-scale computing [J].
Allen, Michael .
PHYSICS WORLD, 2022, 35 (03) :46-50
[48]   Computing Large-scale Distance Matrices on GPU [J].
Arefin, Ahmed Shamsul ;
Riveros, Carlos ;
Berretta, Regina ;
Moscato, Pablo .
PROCEEDINGS OF 2012 7TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION, VOLS I-VI, 2012, :576-580
[49]   Fault tolerance in large-scale scientific computing [J].
Hough, Patricia D. ;
Howle, Victoria E. .
PARALLEL PROCESSING FOR SCIENTIFIC COMPUTING, 2006, :203-220
[50]   THEORETICAL SCIENCE AND THE FUTURE OF LARGE-SCALE COMPUTING [J].
WILSON, KG .
USPEKHI FIZICHESKIKH NAUK, 1984, 143 (02) :301-307