LIKWID Monitoring Stack: A flexible framework enabling job specific performance monitoring for the masses

被引：18

作者：

Roehl, Thomas ^{[1
]}

Eitzinger, Jan ^{[1
]}

Hager, Georg ^{[1
]}

Wellein, Gerhard ^{[1
]}

机构：

[1] Univ Erlangen Nurnberg, Erlangen Reg Comp Ctr RRZE, Erlangen, Germany

来源：

2017 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER) | 2017年

关键词：

D O I：

10.1109/CLUSTER.2017.115

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

System monitoring is an established tool to measure the utilization and health of HPC systems. Usually system monitoring infrastructures make no connection to job information and do not utilize hardware performance monitoring (HPM) data. To increase the efficient use of HPC systems automatic and continuous performance monitoring of jobs is an essential component. It can help to identify pathological cases, provides instant performance feedback to the users, offers initial data to judge on the optimization potential of applications and helps to build a statistical foundation about application specific system usage. The LIKWID monitoring stack is a modular framework build on top of the LIKWID tools library. It aims on enabling job specific performance monitoring using HPM data, system metrics and application-level data for small to medium sized commodity clusters. Moreover, it is designed to integrate in existing monitoring infrastructures to speed up the change from pure system monitoring to job-aware monitoring.

引用

页码：781 / 784

页数：4

共 12 条

[1]

[Anonymous], TOOLS HIGH PERFORMAN

[2]

Benedict Shajulin, 2009, Euro-Par 2009 Parallel Processing Workshops. HPPC, HeteroPar, PROPER, ROIA, UNICORE, VHPC. Revised Selected Papers, P199

[3]

Boehme D, 2016, SC '16: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, P550, DOI 10.1109/SC.2016.46

[4]

Carias C. B. Guillen, 2015, THESIS

[5] Comprehensive Resource Use Monitoring for HPC Systems with TACC Stats [J].

Evans, Todd ;

Barth, William L. ;

Browne, James C. ;

DeLeon, Robert L. ;

Furlani, Thomas R. ;

Gallo, Steven M. ;

Jones, Matthew D. ;

Patra, Abani K. .

2014 1ST INTERNATIONAL WORKSHOP ON HPC USER SUPPORT TOOLS (HUST), 2014, :13-21

[6]

Guillen C, 2014, LECT NOTES COMPUT SC, V8806, P363, DOI 10.1007/978-3-319-14313-2_31

[7] The ganglia distributed monitoring system: design, implementation, and experience [J].

Massie, ML ;

Chun, BN ;

Culler, DE .

PARALLEL COMPUTING, 2004, 30 (07) :817-840

[8] NWPerf: A system wide performance monitoring tool for large Linux clusters [J].

Mooney, R ;

Studham, RS ;

Schmidt, KP ;

Nieplocha, J .

2004 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2004, :379-389

[9] Monitoring High Performance Computing Systems for the End User [J].

Moore, Christopher Lee ;

Khalsa, Prabhu Singh ;

Yilk, Todd Alan ;

Mason, Michael .

2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015, 2015, :714-716

[10]

Palmer J., 2015, COMPUTING SCI ENG, V17, P52

← 1 2 →