A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

被引:30
作者
Putnam, Andrew [1 ]
Caulfield, Adrian M. [1 ]
Chung, Eric S. [1 ]
Chiou, Derek [1 ,2 ]
Constantinides, Kypros [3 ]
Demme, John [4 ]
Esmaeilzadeh, Hadi [5 ]
Fowers, Jeremy [1 ]
Gopal, Gopi Prashanth [1 ]
Gray, Jan [1 ]
Haselman, Michael [1 ]
Hauck, Scott [1 ,6 ]
Heil, Stephen [1 ]
Hormati, Amir [7 ]
Kim, Joo-Young [1 ]
Lanka, Sitaram [1 ]
Larus, James [8 ]
Peterson, Eric [1 ]
Pope, Simon [1 ]
Smith, Aaron [1 ]
Thong, Jason [1 ]
Xiao, Phillip Yi [1 ]
Burger, Doug [1 ]
机构
[1] Microsoft, Redmond, WA 98052 USA
[2] Univ Texas Austin, Austin, TX 78712 USA
[3] Amazon Web Serv, Boston, MA USA
[4] Columbia Univ, New York, NY USA
[5] Georgia Inst Technol, Atlanta, GA 30332 USA
[6] Univ Washington, Seattle, WA 98195 USA
[7] Google Inc, Mountain View, CA USA
[8] Ecole Polytech Fed Lausanne, Lausanne, Switzerland
关键词
D O I
10.1145/2996868
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we designed and built a composable, reconfigurable hardware fabric based on field programmable gate arrays (FPGA). Each server in the fabric contains one FPGA, and all FPGAs within a 48-server rack are interconnected over a low-latency, high-bandwidth network. We describe a medium-scale deployment of this fabric on a bed of 1632 servers, and measure its effectiveness in accelerating the ranking component of the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system. Under high load, the large-scale reconfigurable fabric improves the ranking throughput of each server by 95% at a desirable latency distribution or reduces tail latency by 29% at a fixed throughput. In other words, the reconfigurable fabric enables the same throughput using only half the number of servers.
引用
收藏
页码:114 / 122
页数:9
相关论文
共 50 条
  • [11] Parallel Simulation Models for the Evaluation of Future Large-Scale Datacenter Networks
    Lugones, Diego
    Katrinis, Kostas
    Collier, Martin
    Theodoropoulos, Georgios
    2012 IEEE/ACM 16TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED SIMULATION AND REAL TIME APPLICATIONS (DS-RT), 2012, : 85 - 92
  • [12] Joint Progressive Network and Datacenter Recovery After Large-Scale Disasters
    Ferdousi, Sifat
    Tornatore, Massimo
    Dikbiyik, Ferhat
    Martel, Charles U.
    Xu, Sugang
    Hirota, Yusuke
    Awaji, Yoshinari
    Mukherjee, Biswanath
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2020, 17 (03): : 1501 - 1514
  • [13] Decoupling Datacenter Storage Studies from Access to Large-Scale Applications
    Delimitrou, Christina
    Sankar, Sriram
    Vaid, Kushagra
    Kozyrakis, Christos
    IEEE COMPUTER ARCHITECTURE LETTERS, 2012, 11 (02) : 53 - 56
  • [14] Accelerating Large-Scale Inference with Anisotropic Vector Quantization
    Guo, Ruiqi
    Sun, Philip
    Lindgren, Erik
    Geng, Quan
    Simcha, David
    Chern, Felix
    Kumar, Sanjiv
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
  • [15] Dimensioning large-scale membership services
    de Vericourt, Francis
    Jennings, Otis B.
    OPERATIONS RESEARCH, 2008, 56 (01) : 173 - 187
  • [16] Accelerating large-scale HPC Applications using FPGAs
    Dimond, Rob
    Racaniere, Sebastien
    Pell, Oliver
    2011 20TH IEEE SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH-20), 2011, : 191 - 192
  • [17] ggViz: Accelerating Large-Scale Esports Game Analysis
    Xenopoulos P.
    Rulff J.
    Silva C.
    Proceedings of the ACM on Human-Computer Interaction, 2022, 6
  • [18] Accelerating Large-Scale Statistical Computation With the GOEM Algorithm
    Nie, Xiao
    Huling, Jared
    Qian, Peter Z. G.
    TECHNOMETRICS, 2017, 59 (04) : 416 - 425
  • [19] Accelerating large-scale graph analytics with FPGA and HMC
    Khoram, Soroosh
    Zhang, Jialiang
    Strange, Maxwell
    Li, Jing
    2017 IEEE 25TH ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM 2017), 2017, : 82 - 82
  • [20] Accelerating Range Queries for Large-scale Unstructured Meshes
    Nguyen, Cuong
    Rhodes, Philip J.
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 502 - 511