The Case for Domain-Specific Networks

被引:0
作者
Abts, Dennis [1 ]
Kim, John [2 ]
机构
[1] NVIDIA, Santa Clara, CA 95050 USA
[2] Korea Adv Inst Sci & Technol, Daejeon, South Korea
来源
2023 IEEE SYMPOSIUM ON HIGH-PERFORMANCE INTERCONNECTS, HOTI | 2023年
关键词
D O I
10.1109/HOTI59126.2023.00021
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Modern parallel computers are dichotomized into capacity or capability systems. Capacity systems cater to a wide range of weak scaling workloads, using distributed parallel systems with message passing while capability systems focus on strong scaling workloads across a significant fraction of the machine's processing units. The interconnection network differs under these regimes, with commodity Ethernet or Infiniband solutions typically deployed for capacity systems, while capabilityclass systems often necessitate tightly-coupled, fine-grained communication. Systems built for AI training and inference embody traits from both classes: tight coupling and strong scaling for model parallelism, and weak scaling for data parallelism in a distributed system. Handling 100-billion-parameter large-language models and trillion-token data sets presents computational challenges for current supercomputing infrastructure. This paper discusses the crucial role of the interconnection network in these large-scale systems, advocating for flexible, low-latency interconnects that can deliver high throughput at large scales with tens of thousands of endpoints. This work also emphasizes the importance of reliability and resilience in enduring long-running training workloads and demanding inference requirements of domain-specific workloads.
引用
收藏
页码:49 / 52
页数:4
相关论文
共 8 条
[1]   A Software-defined Tensor Streaming Multiprocessor for Large-scale Machine Learning [J].
Abts, Dennis ;
Kimmell, Garrin ;
Ling, Andrew ;
Kim, John ;
Boyd, Matt ;
Bitar, Andrew ;
Parmar, Sahil ;
Ahmed, Ibrahim ;
DiCecco, Roberto ;
Han, David ;
Thompson, John ;
Bye, Michael ;
Hwang, Jennifer ;
Fowers, Jeremy ;
Lillian, Peter ;
Murthy, Ashwin ;
Mehtabuddin, Elyas ;
Tekur, Chetan ;
Sohmers, Thomas ;
Kang, Kris ;
Maresh, Stephen ;
Ross, Jonathan .
PROCEEDINGS OF THE 2022 THE 49TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA '22), 2022, :567-580
[2]   Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads [J].
Abts, Dennis ;
Ross, Jonathan ;
Sparling, Jonathan ;
Wong-VanHaren, Mark ;
Baker, Max ;
Hawkins, Tom ;
Bell, Andrew ;
Thompson, John ;
Kahsai, Temesghen ;
Kimmell, Garrin ;
Hwang, Jennifer ;
Leslie-Hurd, Rebekah ;
Bye, Michael ;
Creswick, E. R. ;
Boyd, Matthew ;
Venigalla, Mahitha ;
Laforge, Evan ;
Purdy, Jon ;
Kamath, Purushotham ;
Maheshwari, Dinesh ;
Beidler, Michael ;
Rosseel, Geert ;
Ahmad, Omar ;
Gagarin, Gleb ;
Czekalski, Richard ;
Rane, Ashay ;
Parmar, Sahil ;
Werner, Jeff ;
Sproch, Jim ;
Macias, Adrian ;
Kurtz, Brian .
2020 ACM/IEEE 47TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2020), 2020, :145-158
[3]  
Jouppi N. P., 2023, Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings
[4]   Technology-driven, highly-scalable dragonfly topology [J].
Kim, John ;
Dally, William J. ;
Scott, Steve ;
Abts, Dennis .
ISCA 2008 PROCEEDINGS: 35TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, 2008, :77-+
[5]   An In-Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives [J].
Klenk, Benjamin ;
Jiang, Nan ;
Thorson, Greg ;
Dennison, Larry .
2020 ACM/IEEE 47TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2020), 2020, :996-1009
[6]  
Kwon Y., 2023, HOT CHIPS 35 S
[7]   Anton 2: Raising the bar for performance and programmability in a special-purpose molecular dynamics supercomputer [J].
Shaw, David E. ;
Grossman, J. P. ;
Bank, Joseph A. ;
Batson, Brannon ;
Butts, J. Adam ;
Chao, Jack C. ;
Deneroff, Martin M. ;
Dror, Ron O. ;
Even, Amos ;
Fenton, Christopher H. ;
Forte, Anthony ;
Gagliardo, Joseph ;
Gill, Gennette ;
Greskamp, Brian ;
Ho, C. Richard ;
Ierardi, Douglas J. ;
Iserovich, Lev ;
Kuskin, Jeffrey S. ;
Larson, Richard H. ;
Layman, Timothy ;
Lee, Li-Siang ;
Lerer, Adam K. ;
Li, Chester ;
Killebrew, Daniel ;
Mackenzie, Kenneth M. ;
Mok, Shark Yeuk-Hai ;
Moraes, Mark A. ;
Mueller, Rolf ;
Nociolo, Lawrence J. ;
Peticolas, Jon L. ;
Quan, Terry ;
Ramot, Daniel ;
Salmon, John K. ;
Scarpazza, Daniele P. ;
Ben Schafer, U. ;
Siddique, Naseer ;
Snyder, Christopher W. ;
Spengler, Jochen ;
Tang, Ping Tak Peter ;
Theobald, Michael ;
Toma, Horia ;
Towles, Brian ;
Vitale, Benjamin ;
Wang, Stanley C. ;
Young, Cliff .
SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, :41-53
[8]  
Talpes E., 2022, HOT CHIPS 34 S, P1