MAP: A Visual Analytics System for Job Monitoring and Analysis

被引：3

作者：

Pal, Ashish ^{[1
]}

Malakar, Preeti ^{[1
]}

机构：

[1] IIT Kanpur, Dept Comp Sci & Engn, Kanpur, Uttar Pradesh, India

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2020) | 2020年

关键词：

job log analysis; system monitoring; visual analytics; D3; visualization;

D O I：

10.1109/CLUSTER49012.2020.00063

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

High-performance computing systems are used for compute-intensive jobs by multiple users. They submit jobs to batch queues where the jobs are queued for an unknown amount of time until the required resources are available. A large amount of data is collected by the resource managers regarding the jobs (submit time, start time, end time, resource requirements, etc.). Analyzing this data may help identify causes of problems that may have occurred in the past and better optimize the system. Analyzing complex and huge logs may be cumbersome. We have developed a unified job monitoring, analysis, and prediction system using which users can monitor current state, analyze past job logs, and predict wait-times of future jobs. In this paper, we have focused on the job monitoring and analysis modules.

引用

页码：442 / 448

页数：7

共 10 条

[1] The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications [J].

Agelastos, Anthony ;

Allan, Benjamin ;

Brandt, Jim ;

Cassella, Paul ;

Enos, Jeremy ;

Fullop, Joshi ;

Gentile, Ann ;

Monk, Steve ;

Naksinehaboon, Nichamon ;

Ogden, Jeff ;

Rajan, Mahesh ;

Showerman, Michael ;

Stevenson, Joel ;

Taerat, Narate ;

Tucker, Tom .

SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, :154-165

[2]

Filli^atre JC, 2013, INT J MOL SCI

[3]

Henderson R. L., 1995, P WORKSH JOB SCHED S

[4] FAT-TREES - UNIVERSAL NETWORKS FOR HARDWARE-EFFICIENT SUPERCOMPUTING [J].

LEISERSON, CE .

IEEE TRANSACTIONS ON COMPUTERS, 1985, 34 (10) :892-901

[5] The ganglia distributed monitoring system: design, implementation, and experience [J].

Massie, ML ;

Chun, BN ;

Culler, DE .

PARALLEL COMPUTING, 2004, 30 (07) :817-840

[6] What supercomputers say: A study of five system logs [J].

Oliner, Adam ;

Stearley, Jon .

37TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2007, :575-+

[7] Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing Resources [J].

Palmer, Jeffrey T. ;

Gallo, Steven M. ;

Furlani, Thomas R. ;

Jones, Matthew D. ;

DeLeon, Robert L. ;

White, Joseph P. ;

Simakov, Nikolay ;

Patra, Abani K. ;

Sperhac, Jeanette ;

Yearke, Thomas ;

Rathsam, Ryan ;

Innus, Martins ;

Cornelius, Cynthia D. ;

Browne, James C. ;

Barth, William L. ;

Evans, Richard T. .

COMPUTING IN SCIENCE & ENGINEERING, 2015, 17 (04) :52-62

[8] Design and Implementation of a Scalable HPC Monitoring System [J].

Sanchez, S. ;

Bonnie, A. ;

Van Heule, G. ;

Robinson, C. ;

DeConinck, A. ;

Kelly, K. ;

Snead, Q. ;

Brandt, J. .

2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, :1721-1725

[9]

Vyas RA, 2014, IEEE INT ADV COMPUT, P793, DOI 10.1109/IAdCC.2014.6779424

[10]

Wei WG, 2004, LECT NOTES COMPUT SC, V3033, P89

← 1 →