Sparrow: Distributed, Low Latency Scheduling

被引:323
作者
Ousterhout, Kay [1 ]
Wendell, Patrick [1 ]
Zaharia, Matei [1 ]
Stoica, Ion [1 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
来源
SOSP'13: PROCEEDINGS OF THE TWENTY-FOURTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES | 2013年
关键词
D O I
10.1145/2517349.2522716
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Large-scale data analytics frameworks are shifting towards shorter task durations and larger degrees of parallelism to provide low latency. Scheduling highly parallel jobs that complete in hundreds of milliseconds poses a major challenge for task schedulers, which will need to schedulemillions of tasks per second on appropriate machines while offering millisecond-level latency and high availability. We demonstrate that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design. We implement and deploy our scheduler, Sparrow, on a 110-machine cluster and demonstrate that Sparrow performs within 12% of an ideal scheduler.
引用
收藏
页码:69 / 84
页数:16
相关论文
共 24 条
  • [1] Ananthanarayanan G., 2012, HOTCLOUD
  • [2] Ananthanarayanan G., 2010, P OSDI
  • [3] [Anonymous], 2009, P SOSP
  • [4] [Anonymous], 2012, P 9 USENIX C NETWORK
  • [5] [Anonymous], P SOCC
  • [6] [Anonymous], 2009, Hadoop: The Definitive Guide
  • [7] [Anonymous], 2010, P EUROSYS
  • [8] An update on the scalability limits of the Condor batch system
    Bradley, D.
    St Clair, T.
    Farrellee, M.
    Guo, Z.
    Livny, M.
    Sfiligoi, I.
    Tannenbaum, T.
    [J]. INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP 2010), 2011, 331
  • [9] The Tail at Scale
    Dean, Jeffrey
    Barroso, Luiz Andre
    [J]. COMMUNICATIONS OF THE ACM, 2013, 56 (02) : 74 - 80
  • [10] Demers A., 1989, P SIGCOMM