Automatic Monitoring of Large-Scale Computing Infrastructure

被引:0
作者
Kim, Bockjoo [1 ]
Bourilkov, Dimitri [1 ]
机构
[1] Univ Florida, Dept Phys, Gainesville, FL 32611 USA
来源
26TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS, CHEP 2023 | 2024年 / 295卷
关键词
D O I
10.1051/epjconf/202429507007
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Modern distributed computing systems produce large amounts of monitoring data. For these systems to operate smoothly, under-performing or failing components must be identified quickly, and preferably automatically, enabling the system managers to react accordingly. In this contribution, we analyze jobs and transfer data collected in the running of the LHC computing infrastructure. The monitoring data is harvested from the Elasticsearch database and converted to formats suitable for further processing. Based on various machine and deep learning techniques, we develop automatic tools for continuous monitoring of the health of the underlying systems. Our initial implementation is based on publicly available deep learning tools, PyTorch or TensorFlow packages, running on state-of-the-art GPU systems.
引用
收藏
页数:7
相关论文
共 50 条
[31]   Large-scale automatic depression screening using meta-data from wifi infrastructure [J].
Ware, Shweta ;
Yue, Chaoqun ;
Morillo, Reynaldo ;
Lu, Jin ;
Shang, Chao ;
Kamath, Jayesh ;
Bamis, Athanasios ;
Bi, Jinbo ;
Russell, Alexander ;
Wang, Bing .
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2018, 2 (04)
[32]   Parameter Communication Consistency Model for Large-Scale Security Monitoring Based on Mobile Computing [J].
Yang, Rui ;
Zhang, Jilin ;
Wan, Jian ;
Zhou, Li ;
Shen, Jing ;
Zhang, Yunchen ;
Wei, Zhenguo ;
Zhang, Juncong ;
Wang, Jue .
IEEE ACCESS, 2019, 7 :171884-171897
[33]   Applying Cluster Computing to Enable a Large-scale Smart Grid Stability Monitoring Application [J].
Interrante, John ;
Aggour, Kareem S. .
2012 IEEE 14TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2012 IEEE 9TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (HPCC-ICESS), 2012, :328-335
[34]   A Glimpse of the Information Communication Networks of the Future: Research on Optical Path Design for Large-scale Computing Infrastructure [J].
Inoue T. .
NTT Technical Review, 2021, 19 (10) :11-14
[35]   Toward Control of Large-Scale Quantum Computing [J].
DiVincenzo, David P. .
SCIENCE, 2011, 334 (6052) :50-51
[36]   Computing Asymptotic Gains of Large-Scale Interconnections [J].
Rueffer, Bjoern S. ;
Ito, Hiroshi ;
Dower, Peter M. .
49TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2010, :7413-7418
[37]   The Application of Cloud Computing in Large-Scale Statistic [J].
Sun Xiuli ;
Li Ying ;
Hu Baofang ;
Sun Hongfeng .
PROCEEDINGS OF THE 1ST INTERNATIONAL WORKSHOP ON CLOUD COMPUTING AND INFORMATION SECURITY (CCIS 2013), 2013, 52 :308-311
[38]   SCIDDLE - A TOOL FOR LARGE-SCALE DISTRIBUTED COMPUTING [J].
ARBENZ, P ;
SPRENGER, C ;
LUTHI, HP ;
VOGEL, S .
CONCURRENCY-PRACTICE AND EXPERIENCE, 1995, 7 (02) :121-146
[39]   On Execution Platforms for Large-Scale Aggregate Computing [J].
Viroli, Mirko ;
Casadei, Roberto ;
Pianini, Danilo .
UBICOMP'16 ADJUNCT: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING, 2016, :1321-1326
[40]   The National Weather Sensor Grid: a large-scale cyber-sensor infrastructure for environmental monitoring [J].
Lim, Hock Beng ;
Iqbal, Mudasser ;
Wang, Wenqiang ;
Yao, Yuxia .
INTERNATIONAL JOURNAL OF SENSOR NETWORKS, 2010, 7 (1-2) :19-36