Detection and analysis of resource usage anomalies in large distributed systems through multi-scale visualization

被引：4

作者：

Schnorr, Lucas Mello ^{[1
]}

Legrand, Arnaud ^{[1
]}

Vincent, Jean-Marc ^{[1
]}

机构：

[1] Univ Grenoble, INRIA, CNRS, Grenoble, France

来源：

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2012年 / 24卷 / 15期

关键词：

performance visualization analysis; large-scale distributed systems; volunteer computing; grid computing; cloud computing; resource usage anomalies; PERFORMANCE; TOOL;

D O I：

10.1002/cpe.1885

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Understanding the behavior of large scale distributed systems is generally extremely difficult as it requires to observe a very large number of components over very large time. Most analysis tools for distributed systems gather basic information such as individual processor or network utilization. Although scalable because of the data reduction techniques applied before the analysis, these tools are often insufficient to detect or fully understand anomalies in the dynamic behavior of resource utilization and their influence on the applications performance. In this paper, we propose a methodology for detecting resource usage anomalies in large scale distributed systems. The methodology relies on four functionalities: characterized trace collection, multi-scale data aggregation, specifically tailored user interaction techniques, and visualization techniques. We show the efficiency of this approach through the analysis of simulations of the volunteer computing Berkeley Open Infrastructure for Network Computing architecture. Three scenarios are analyzed in this paper: analysis of the resource sharing mechanism, resource usage considering response time instead of throughput, and the evaluation of input file size on Berkeley Open Infrastructure for Network Computing architecture. The results show that our methodology enables to easily identify resource usage anomalies, such as unfair resource sharing, contention, moving network bottlenecks, and harmful short-term resource sharing. Copyright (c) 2011 John Wiley & Sons, Ltd.

引用

页码：1792 / 1816

页数：25

共 47 条

[1]

Aguilera G., 2006, Proceedings. 20th International Parallel and Distributed Processing Symposium (IEEE Cat. No.06TH8860)

[2]

Anderson DP, 2006, SIXTH IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, P73

[3] BOINC: A system for public-resource computing and storage [J].

Anderson, DP .

FIFTH IEEE/ACM INTERNATIONAL WORKSHOP ON GRID COMPUTING, PROCEEDINGS, 2004, :4-10

[4]

Bell R, 2003, LECT NOTES COMPUT SC, V2790, P17

[5] Automatic experimental analysis of communication patterns in virtual topologies [J].

Bhatia, N ;

Song, FS ;

Wolf, F ;

Dongarra, J ;

Mohr, B ;

Moore, S .

2005 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSSING, PROCEEDINGS, 2005, :465-472

[6]

Bird I, 2005, Local to Global Data Interoperability - Challenges and Technologies, P160

[7]

Bruls M, 2000, SPRING COMP SCI, P33

[8] SimGrid: a Generic Framework for Large-Scale Distributed Experiments [J].

Casanova, Henri ;

Legrand, Arnaud ;

Quinson, Martin .

2008 UKSIM TENTH INTERNATIONAL CONFERENCE ON COMPUTER MODELING AND SIMULATION, 2008, :126-131

[9]

de Kergommeaux JC, 2000, LECT NOTES COMPUT SC, V1900, P133

[10]

Donassolo B, 2010, WORKSH LARG SCAL SYS

← 1 2 3 4 5 →