Lightwave Fabrics: At-Scale Optical Circuit Switching for Datacenter and Machine Learning Systems

被引:32
作者
Liu, Hong [1 ]
Urata, Ryohei [1 ]
Yasumura, Kevin [1 ]
Zhou, Xiang [1 ]
Bannon, Roy [1 ]
Berger, Jill [1 ]
Dashti, Pedram [1 ]
Jouppi, Norm [1 ]
Lam, Cedric [1 ]
Li, Sheng [1 ]
Mao, Erji [1 ]
Nelson, Daniel [1 ]
Papen, George [1 ]
Tariq, Mukarram [1 ]
Vahdat, Amin [1 ]
机构
[1] Google, Mountain View, CA 94043 USA
来源
PROCEEDINGS OF THE 2023 ACM SIGCOMM 2023 CONFERENCE, SIGCOMM 2023 | 2023年
关键词
Data center networks; Optical circuit switches; Machine learning;
D O I
10.1145/3603269.3604836
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We describe our experience developing what we believe to be the world's first large-scale production deployments of lightwave fabrics used for both datacenter networking and machine-learning (ML) applications. Using optical circuit switches (OCSes) and optical transceivers developed in-house, we employ hardware and software codesign to integrate the fabrics into our network and computing infrastructure. Key to our design is a high degree of multiplexing enabled by new kinds of wavelength-division-multiplexing (WDM) and optical circulators that support high-bandwidth bidirectional traffic on a single strand of optical fiber. The development of the requisite OCS and optical transceiver technologies leads to a synchronous lightwave fabric that is reconfigurable, low latency, rate agnostic, and highly available. These fabrics have provided substantial benefits for long-lived traffic patterns in our datacenter networks and predictable traffic patterns in tightly-coupled machine learning clusters. We report results for a large-scale ML superpod with 4096 tensor processing unit (TPU) V4 chips that has more than one ExaFLOP of computing power. For this use case, the deployment of a lightwave fabric provides up to 3x better system availability and model-dependent performance improvements of up to 3.3x compared to a static fabric, despite constituting less than 6% of the total system cost.
引用
收藏
页码:499 / 515
页数:17
相关论文
共 66 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]   A scalable, commodity data center network architecture [J].
Al-Fares, Mohammad ;
Loukissas, Alexander ;
Vahdat, Amin .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2008, 38 (04) :63-74
[3]  
[Anonymous], 2020, Xla: Compiling machine learning for peak performance
[4]  
[Anonymous], 2023, Huber+Suhner Polatis
[5]   Sirius: A Flat Datacenter Network with Nanosecond Optical Switching [J].
Ballani, Hitesh ;
Costa, Paolo ;
Behrendt, Raphael ;
Cletheroe, Daniel ;
Haller, Istvan ;
Jozwik, Krzysztof ;
Karinou, Fotini ;
Lange, Sophie ;
Shi, Kai ;
Thomsen, Benn ;
Williams, Hugh .
SIGCOMM '20: PROCEEDINGS OF THE 2020 ANNUAL CONFERENCE OF THE ACM SPECIAL INTEREST GROUP ON DATA COMMUNICATION ON THE APPLICATIONS, TECHNOLOGIES, ARCHITECTURES, AND PROTOCOLS FOR COMPUTER COMMUNICATION, 2020, :782-797
[6]  
Barker KevinJ., 2005, P ACMIEEE SUPERCOMPU, P16
[7]   Optical packet and burst switching technologies for the future photonic Internet [J].
Ben Yoo, S. J. .
JOURNAL OF LIGHTWAVE TECHNOLOGY, 2006, 24 (12) :4468-4492
[8]  
Calder B, 2011, SOSP 11: PROCEEDINGS OF THE TWENTY-THIRD ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES, P143
[9]  
Calient Technologies, 2023, About us
[10]   TROD: Evolving From Electrical Data Center to Optical Data Center [J].
Cao, Peirui ;
Zhao, Shizhen ;
Teh, Min Yee ;
Liu, Yunzhuo ;
Wang, Xinbing .
2021 IEEE 29TH INTERNATIONAL CONFERENCE ON NETWORK PROTOCOLS (ICNP 2021), 2021,