LIMITLESS - LIght-weight MonItoring Tool for LargE Scale Systems

被引:4
作者
Cascajo, Alberto [1 ]
Singh, David E. [1 ]
Carretero, Jesus [1 ]
机构
[1] Univ Carlos III Madrid, Madrid, Spain
来源
2021 29TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2021) | 2021年
基金
欧盟地平线“2020”;
关键词
Monitoring; application modelling; performance prediction; PERFORMANCE;
D O I
10.1109/PDP52278.2021.00042
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This work presents LIMITLESS, a HPC framework that provides new strategies for monitoring clusters. LIMITLESS is a scalable light-weight monitor that is integrated with other HPC runtimes in order to obtain an holistic view of the system that combines both platform and application monitoring. This paper presents a description of the novel components of the architecture, including new approaches for reaching a higher scalability based on a combination of in-transit processing and performance prediction. This work also includes a practical evaluation on simulated and real platforms, that shows significant monitoring scalability, retrieving data capacity and reduced overheads.
引用
收藏
页码:220 / 227
页数:8
相关论文
共 21 条
[1]   Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems [J].
Agelastos, Anthony ;
Allan, Benjamin ;
Brandt, Jim ;
Gentile, Ann ;
Lefantzi, Sophia ;
Monk, Steve ;
Ogden, Jeff ;
Rajan, Mahesh ;
Stevenson, Joel .
PARALLEL COMPUTING, 2016, 58 :90-106
[2]  
[Anonymous], 2018, NAGIOS IND STANDARD NAGIOS IND STANDARD
[3]  
[Anonymous], 2015, Kibana essentials
[4]  
Brown C., 2019, P IEEE INT C CLUST C P IEEE INT C CLUST C, V2019
[5]   Performance-Aware Scheduling of Parallel Applications on Non-Dedicated Clusters [J].
Cascajo, Alberto ;
Singh, David E. ;
Carretero, Jesus .
ELECTRONICS, 2019, 8 (09)
[6]  
Cunningham P., 2020, ARXIV PREPRINT ARXIV
[7]  
Eitzinger J., 2019, P IEEE INT C CLUST C P IEEE INT C CLUST C
[8]  
Gormley Clinton, 2015, Elasticsearch: The definitive guide
[9]   CLARISSE: a middleware for data-staging coordination and control on large-scale HPC platforms [J].
Isaila, Florin ;
Carretero, Jesus ;
Ross, Rob .
2016 16TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2016, :346-355
[10]   Enhancing the performance of malleable MPI applications by using performance-aware dynamic reconfiguration [J].
Martin, Gonzalo ;
Singh, David E. ;
Marinescu, Maria-Cristina ;
Carretero, Jesus .
PARALLEL COMPUTING, 2015, 46 :60-77