Timely Long Tail Identification Through Agent Based Monitoring and Analytics

被引:14
作者
Garraghan, Peter [1 ]
Ouyang, Xue [1 ]
Townend, Paul [1 ]
Xu, Jie [1 ]
机构
[1] Univ Leeds, Sch Comp, Leeds, W Yorkshire, England
来源
2015 IEEE 18TH INTERNATIONAL SYMPOSIUM ON REAL-TIME DISTRIBUTED COMPUTING (ISORC) | 2015年
关键词
Long Tail; Stragglers; Distributed Systems; Data analysis; agent based; datacenter; Cloud computing;
D O I
10.1109/ISORC.2015.39
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The increasing complexity and scale of distributed systems has resulted in the manifestation of emergent behavior which substantially affects overall system performance. A significant emergent property is that of the "Long Tail", whereby a small proportion of task stragglers significantly impact job execution completion times. To mitigate such behavior, straggling tasks occurring within the system need to be accurately identified in a timely manner. However, current approaches focus on mitigation rather than identification, which typically identify stragglers too late in the execution lifecycle. This paper presents a method and tool to identify Long Tail behavior within distributed systems in a timely manner, through a combination of online and offline analytics. This is achieved through historical analysis to profile and model task execution patterns, which then inform online analytic agents that monitor task execution at runtime. Furthermore, we provide an empirical analysis of two large-scale production Cloud datacenters that demonstrate the challenge of data skew within modern distributed systems; this analysis shows that approximately 5% of task stragglers caused by data skew impact 50% of the total jobs for batch processes. Our results demonstrate that our approach is capable of identifying task stragglers less than 11% into their execution lifecycle with 98% accuracy, signifying significant improvement over current state-of-the-art practice and enables far more effective mitigation strategies in large-scale distributed systems worldwide.
引用
收藏
页码:19 / 26
页数:8
相关论文
共 23 条
  • [1] Ananthanarayanan G., 2010, P USENIX C OP SYST D, V10
  • [2] Ananthanarayanan G., 2013, 10 USENIX S NETW SYS, V13, P185
  • [3] Ananthanarayanan G., 2014, P 11 USENIX C NETWOR, P289
  • [4] [Anonymous], IEEE T COMPUTERS
  • [5] [Anonymous], 2008, 8 USENIX S OP SYST D
  • [6] [Anonymous], 2011, CISC VIS NETW IND GL
  • [7] Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
  • [8] The Tail at Scale
    Dean, Jeffrey
    Barroso, Luiz Andre
    [J]. COMMUNICATIONS OF THE ACM, 2013, 56 (02) : 74 - 80
  • [9] Challenges in real-time virtualization and predictable cloud computing
    Garcia-Valls, Marisol
    Cucinotta, Tommaso
    Lu, Chenyang
    [J]. JOURNAL OF SYSTEMS ARCHITECTURE, 2014, 60 (09) : 726 - 740
  • [10] Improving Speculative Execution Performance with Coworker for Cloud Computing
    Huang, Sheng-Wei
    Huang, Tzu-Chi
    Lyu, Syue-Ru
    Shieh, Ce-Kuen
    Chou, Yi-Sheng
    [J]. 2011 IEEE 17TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2011, : 1004 - 1009