Supercomputer 3D Digital Twin for User Focused Real- Time Monitoring

被引:0
作者
Bergeron, Bill [1 ]
Hubbell, Matthew [1 ]
Mojica, Daniel [1 ]
Reuther, Albert [1 ]
Arcand, William [1 ]
Bestor, David [1 ]
Burrill, Daniel [1 ]
Byun, Chansup [1 ]
Gadepally, Vijay [1 ]
Houle, Michael [1 ]
Jananthan, Hayden [1 ]
Jones, Michael [1 ]
Luszczek, Piotr [1 ]
Michaleas, Peter [1 ]
Milechin, Lauren [1 ]
Mullen, Julie [1 ]
Prout, Andrew [1 ]
Rosa, Antonio [1 ]
Yee, Charles [1 ]
Kepner, Jeremy [1 ]
机构
[1] MIT, Cambridge, MA 02139 USA
来源
2024 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE, HPEC | 2024年
关键词
Supercomputing; High Performance Computing; HPC; Digital Twin; 3D Gaming; Gaming Engine; Unity; Supercloud; cloud computing; SYSTEM;
D O I
10.1109/HPEC62836.2024.10938489
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Real-time supercomputing performance analysis is a critical aspect of evaluating and optimizing computational systems in a dynamic user environment. The operation of supercomputers produce vast quantities of analytic data from multiple sources and of varying types so compiling this data in an efficient matter is critical to the process. MIT Lincoln Laboratory Supercomputing Center has been utilizing the Unity 3D game engine to create a Digital Twin of our supercomputing systems for several years to perform system monitoring. Unity offers robust visualization capabilities making it ideal for creating a sophisticated representation of the computational processes. As we scale the systems to include a diversity of resources such as accelerators and the addition of more users, we need to implement new analysis tools for the monitoring system. The workloads in research continuously change, as does the capability of Unity, and this allows us to adapt our monitoring tools to scale and incorporate features enabling efficient replay of system wide events, user isolation, and machine level granularity. Our system fully takes advantage of the modem capabilities of the Unity Engine in a way that intuitively represents the real time workload performed on a supercomputer. It allows HPC system engineers to quickly diagnose usage related errors with its responsive user interface which scales efficiently with large data sets.
引用
收藏
页数:8
相关论文
共 34 条
[1]   Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems [J].
Agelastos, Anthony ;
Allan, Benjamin ;
Brandt, Jim ;
Gentile, Ann ;
Lefantzi, Sophia ;
Monk, Steve ;
Ogden, Jeff ;
Rajan, Mahesh ;
Stevenson, Joel .
PARALLEL COMPUTING, 2016, 58 :90-106
[2]  
Angelov AN, 2007, IEEE POWER ENG SOC, P3695
[3]  
[Anonymous], 2024, Job system manual
[4]  
[Anonymous], What is a digital twin Internet?
[5]  
[Anonymous], 2023, Unity Technologies
[6]  
Apache acumulo, About us
[7]  
Bergeron B., in
[8]   A cost-effective interactive 3D virtual reality system applied to military live firing training [J].
Bhagat, Kaushal Kumar ;
Liou, Wei-Kai ;
Chang, Chun-Yen .
VIRTUAL REALITY, 2016, 20 (02) :127-140
[9]  
Byun C, 2012, IEEE HIGH PERF EXTR
[10]  
Dunn M., lOOO words