Switches for HIRE: Resource Scheduling for Data Center In-Network Computing

被引:22
作者
Bloecher, Marcel [1 ]
Wang, Lin [1 ,2 ]
Eugster, Patrick [3 ,4 ]
Schmidt, Max [1 ]
机构
[1] Tech Univ Darmstadt, Darmstadt, Germany
[2] Vrije Univ Amsterdam, Amsterdam, Netherlands
[3] USI Lugano, Lugano, Switzerland
[4] Purdue Univ, W Lafayette, IN 47907 USA
来源
ASPLOS XXVI: TWENTY-SIXTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS | 2021年
基金
欧洲研究理事会; 瑞士国家科学基金会; 美国国家科学基金会;
关键词
data center; scheduling; in-network computing; heterogeneity; nonlinear resource usage;
D O I
10.1145/3445814.3446760
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The recent trend towards more programmable switching hardware in data centers opens up new possibilities for distributed applications to leverage in-network computing (INC). Literature so far has largely focused on individual application scenarios of INC, leaving aside the problem of coordinating usage of potentially scarce and heterogeneous switch resources among multiple INC scenarios, applications, and users. The traditional model of resource pools of isolated compute containers does not fit an INC-enabled data center. This paper describes HIRE, a Holistic INC-aware Resource managEr which allows for server-local and INC resources to be coordinated in a unified manner. HIRE introduces a novel flexible resource (meta-)model to address heterogeneity, resource interchangeability, and non-linear resource requirements, and integrates dependencies between resources and locations in a unified cost model, cast as a min-cost max-flow problem. In absence of prior work, we compare HIRE against variants of state-of-the-art schedulers retrofitted to handle INC requests. Experiments with a workload trace of a 4000 machine cluster show that HIRE makes better use of INC resources by serving 8- 30% more INC requests, while at the same time reducing network detours by 20%, and reducing tail placement latency by 50%.
引用
收藏
页码:268 / 285
页数:18
相关论文
共 92 条
[81]  
Xiao WC, 2018, PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P595
[82]   The Only Constant is Change: Incorporating Time-Varying Network Reservations in Data Centers [J].
Xie, Di ;
Ding, Ning ;
Hu, Y. Charlie ;
Kompella, Ramana .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2012, 42 (04) :199-210
[83]   Do Switches Dream of Machine Learning? Toward In-Network Classification [J].
Xiong, Zhaoqi ;
Zilberman, Noa .
PROCEEDINGS OF THE EIGHTEENTH ACM WORKSHOP ON HOT TOPICS IN NETWORKS (HOTNETS '19), 2019, :25-33
[84]   NetLock: Fast, Centralized Lock Management Using Programmable Switches [J].
Yu, Zhuolong ;
Zhang, Yiwen ;
Braverman, Vladimir ;
Chowdhury, Mosharaf ;
Jin, Xin .
SIGCOMM '20: PROCEEDINGS OF THE 2020 ANNUAL CONFERENCE OF THE ACM SPECIAL INTEREST GROUP ON DATA COMMUNICATION ON THE APPLICATIONS, TECHNOLOGIES, ARCHITECTURES, AND PROTOCOLS FOR COMPUTER COMMUNICATION, 2020, :126-138
[85]  
Zaharia M, 2010, EUROSYS'10: PROCEEDINGS OF THE EUROSYS 2010 CONFERENCE, P265
[86]  
Zhang C., 2017, IEEE INT C COMP COMM, DOI [10.1109/ICCCN.2017.8038396, DOI 10.1109/ICCCN.2017.8038396]
[87]   HyperVDP: High-Performance Virtualization of the Programmable Data Plane [J].
Zhang, Cheng ;
Bi, Jun ;
Zhou, Yu ;
Wu, Jianping .
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2019, 37 (03) :556-569
[88]   SLAQ: Quality-Driven Scheduling for Distributed Machine Learning [J].
Zhang, Haoyu ;
Stafman, Logan ;
Or, Andrew ;
Freedman, Michael J. .
PROCEEDINGS OF THE 2017 SYMPOSIUM ON CLOUD COMPUTING (SOCC '17), 2017, :390-404
[89]   Graphit: A high-performance graph DSL [J].
Zhang Y. ;
Yang M. ;
Baghdadi R. ;
Kamil S. ;
Shun J. ;
Amarasinghe S. .
Proceedings of the ACM on Programming Languages, 2018, 2 (OOPSLA)
[90]   P4Visor: Lightweight Virtualization and Composition Primitives for Building and Testing Modular Programs [J].
Zheng, Peng ;
Benson, Theophilus ;
Hu, Chengchen .
CONEXT'18: PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON EMERGING NETWORKING EXPERIMENTS AND TECHNOLOGIES, 2018, :98-111