MilliSort and MilliQuery: Large-Scale Data-Intensive Computing in Milliseconds

被引:0
|
作者
Li, Yilong [1 ]
Park, Seo Jin [2 ]
Ousterhout, John [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] MIT CSAIL, Cambridge, MA USA
来源
PROCEEDINGS OF THE 18TH USENIX SYMPOSIUM ON NETWORKED SYSTEM DESIGN AND IMPLEMENTATION | 2021年
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Today's datacenter applications couple scale and time: applications that harness large numbers of servers also execute for long periods of time (seconds or more). This paper explores the possibility of flash bursts: applications that use a large number of servers but for very short time intervals (as little as one millisecond). In order to learn more about the feasibility of flash bursts, we developed two new benchmarks, MilliSort and MilliQuery. MilliSort is a sorting application and MilliQuery implements three SQL queries. The goal for both applications was to process as many records as possible in one millisecond, given unlimited resources in a datacenter. The short time scale required a new distributed sorting algorithm for MilliSort that uses a hierarchical form of partitioning. Both applications depended on fast group communication primitives such as shuffle and all-gather. Our implementation of MilliSort can sort 0.84 million items in one millisecond using 120 servers on an HPC cluster; MilliQuery can process .03-48 million items in one millisecond using 60-280 servers, depending on the query. The number of items that each application can process grows quadratically with the time budget. The primary obstacle to scalability is per-message costs, which appear in the form of inefficient shuffles and coordination overhead.
引用
收藏
页码:593 / 612
页数:20
相关论文
共 50 条
  • [1] Distributed Data Provenance for Large-Scale Data-Intensive Computing
    Zhao, Dongfang
    Shou, Chen
    Malik, Tanu
    Raicu, Ioan
    2013 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2013,
  • [2] GridBatch: Cloud Computing for Large-Scale Data-Intensive Batch Applications
    Liu, Huan
    Orban, Dan
    CCGRID 2008: EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, VOLS 1 AND 2, PROCEEDINGS, 2008, : 295 - 305
  • [3] Passive Network Performance Estimation for Large-Scale, Data-Intensive Computing
    Kim, Jinoh
    Chandra, Abhishek
    Weissman, Jon B.
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2011, 22 (08) : 1365 - 1373
  • [4] Study of performance evaluation for data-intensive large-scale systems
    Liu, Ying
    Song, Huaiming
    Jiao, Limei
    AMS 2007: FIRST ASIA INTERNATIONAL CONFERENCE ON MODELLING & SIMULATION ASIA MODELLING SYMPOSIUM, PROCEEDINGS, 2007, : 270 - +
  • [5] Software architecture for large-scale, distributed, data-intensive systems
    Mattmann, CA
    Crichton, DJ
    Hughes, JS
    Kelly, SC
    Ramirez, PM
    FOURTH WORKING IEEE/IFIP CONFERENCE ON SOFTWARE ARCHITECTURE (WICSA 2004), PROCEEDINGS, 2004, : 255 - 264
  • [6] FRAMEWORK FOR DATA-INTENSIVE APPLICATIONS OPTIMIZATIONIN LARGE-SCALE DISTRIBUTED SYSTEMS
    Cirstoiu, Catalin
    Tapus, Nicolae
    UNIVERSITY POLITEHNICA OF BUCHAREST SCIENTIFIC BULLETIN SERIES C-ELECTRICAL ENGINEERING AND COMPUTER SCIENCE, 2009, 71 (03): : 89 - 104
  • [7] A Data-Intensive Workflow Scheduling Algorithm for Large-scale Cooperative Work Platform
    Cui, Lizhen
    Xu, Meng
    Wang, Haiyang
    2009 13TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, 2009, : 486 - 491
  • [8] A WSRF based adaptive data transmission mechanism in large-scale data-intensive simulation grid
    Wang, K
    Du, ZH
    Chai, YP
    Li, SL
    System Simulation and Scientific Computing, Vols 1 and 2, Proceedings, 2005, : 651 - 655
  • [9] Applications in Data-Intensive Computing
    Shah, Anuj R.
    Adkins, Joshua N.
    Baxter, Douglas J.
    Cannon, William R.
    Chavarria-Miranda, Daniel G.
    Choudhury, Sutanay
    Gorton, Ian
    Gracio, Deborah K.
    Halter, Todd D.
    Jaitly, Navdeep D.
    Johnson, John R.
    Kouzes, Richard T.
    Macduff, Matthew C.
    Marquez, Andres
    Monroe, Matthew E.
    Oehmen, Christopher S.
    Pike, William A.
    Scherrer, Chad
    Villa, Oreste
    Webb-Robertson, Bobbie-Jo
    Whitney, Paul D.
    Zuljevic, Nino
    ADVANCES IN COMPUTERS, VOL 79, 2010, 79 : 1 - 70
  • [10] Next Generation HPC Clouds: A View for Large-Scale Scientific and Data-Intensive Applications
    Petcu, Dana
    Gonzalez-Velez, Horacio
    Nicolae, Bogdan
    Garcia-Gomez, Juan Miguel
    Fuster-Garcia, Elies
    Sheridan, Craig
    EURO-PAR 2014: PARALLEL PROCESSING WORKSHOPS, PT II, 2014, 8806 : 26 - 37