Turbo: Efficient Communication Framework for Large-scale Data Processing Cluster

被引:0
作者
Jia, Xuya [1 ]
Yao, Zhiyi [1 ,2 ]
Peng, Chao [1 ,2 ]
Zhao, Zihao [3 ]
Lei, Bin [3 ]
Liu, Edison [3 ]
Li, Xiang [1 ]
He, Zekun [1 ]
Wang, Yachen [1 ]
Zou, Xianneng [1 ]
Zhao, Chongqing [1 ]
Chu, Jinhui [1 ]
Wang, Jilong [4 ]
Miao, Congcong [1 ]
机构
[1] Tencent, Shenzhen, Peoples R China
[2] Fudan Univ, Shanghai, Peoples R China
[3] NVIDIA, Santa Clara, CA USA
[4] Tsinghua Univ, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 2024 ACM SIGCOMM 2024 CONFERENCE, ACM SIGCOMM 2024 | 2024年
关键词
Remote Direct Memory Access; Load balance; Communication middleware; Reliability; RDMA;
D O I
10.1145/3651890.3672241
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Big data processing clusters are suffering from a long job completion time due to the inefficient utilization of the RDMA capability. Our production measurement results in a large-scale cluster with hundreds of server nodes to process large-scale jobs have shown that the existing deployment of RDMA technique results in a long-tail job completion time, with some jobs even taking up more than twice the average time to complete. In this paper, we present the design and implementation of Turbo, an efficient communication framework for the large-scale data processing cluster to achieve high performance and scalability. The core of Turbo's approach is to leverage a dynamic block-level flowlet transmission mechanism and a non-blocking communication middleware to improve the network throughput and enhance system's scalability. Furthermore, Turbo ensures high system reliability by utilizing an external shuffle service as well as TCP serving as a backup. We integrate Turbo into Apache Spark and evaluate Turbo in a small-scale testbed and a large-scale cluster consisting of hundreds of server nodes. The small-scale testbed evaluation results show that Turbo improves the network throughput by 15.1% while maintaining high system reliability. The large-scale production results have shown Turbo can reduce the job completion time by 23.9% and increase the job completion rate by 2.03x over the existing RDMA solutions.
引用
收藏
页码:540 / 553
页数:14
相关论文
共 61 条
  • [1] Aguilera MK, 2018, PROCEEDINGS OF THE 2018 USENIX ANNUAL TECHNICAL CONFERENCE, P775
  • [2] Al-Fares M., 2010, P 7 USENIX C NETWORK, P19, DOI DOI 10.5555/1855711.1855730
  • [3] CONGA: Distributed Congestion-Aware Load Balancing for Datacenters
    Alizadeh, Mohammad
    Edsall, Tom
    Dharmapurikar, Sarang
    Vaidyanathan, Ramanan
    Chu, Kevin
    Fingerhut, Andy
    Lam, Vinh The
    Matus, Francis
    Pan, Rong
    Yadav, Navindra
    Varghese, George
    [J]. SIGCOMM'14: PROCEEDINGS OF THE 2014 ACM CONFERENCE ON SPECIAL INTEREST GROUP ON DATA COMMUNICATION, 2014, : 503 - 514
  • [4] Barroso L. A., 2019, The datacenter as a computer: Designing warehouse-scale machines
  • [5] Carbone P., 2015, IEEE Data Eng. Bull., V38, P28, DOI DOI 10.1109/IC2EW.2016.56
  • [6] Verifying concurrent, crash-safe systems with Perennial
    Chajed, Tej
    Tassarotti, Joseph
    Kaashoek, M. Frans
    Zeldovich, Nickolai
    [J]. PROCEEDINGS OF THE TWENTY-SEVENTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES (SOSP '19), 2019, : 243 - 258
  • [7] MP-RDMA: Enabling RDMA With Multi-Path Transport in Datacenters
    Chen, Guo
    Lu, Yuanwei
    Li, Bojie
    Tan, Kun
    Xiong, Yongqiang
    Cheng, Peng
    Zhang, Jiansong
    Moscibroda, Thomas
    [J]. IEEE-ACM TRANSACTIONS ON NETWORKING, 2019, 27 (06) : 2308 - 2323
  • [8] OPS: Optimized Shuffle Management System for Apache Spark
    Cheng, Yuchen
    Wu, Chunghsuan
    Liu, Yanqiang
    Ren, Rui
    Xu, Hong
    Yang, Bin
    Qi, Zhengwei
    [J]. PROCEEDINGS OF THE 49TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2020, 2020,
  • [9] Davidson A., 2013, OPTIMIZING SHUFFLE P
  • [10] Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137