A Comprehensive Cloud Architecture for Machine Learning-enabled Research

被引:0
作者
Stubbs, Joe [1 ]
Freeman, Nathan [1 ]
Indrakusuma, Dhanny [1 ]
Garcia, Christian [1 ]
Halbach, Francois [1 ]
Hammock, Cody [1 ]
Curbelo, Gilbert [1 ]
Jamthe, Anagha [1 ]
Packard, Mike [1 ]
Fields, Alex [1 ]
机构
[1] Texas Adv Comp Ctr, Austin, TX 78758 USA
来源
PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING 2024, PEARC 2024 | 2024年
基金
美国国家科学基金会;
关键词
GPUs; Cloud Computing; Machine Learning;
D O I
10.1145/3626203.3670525
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The success of machine learning (ML) algorithms, and deep learning in particular, is having a transformative impact on a wide range of research disciplines, from astronomy, materials science, and climate change to bioinformatics, computational health, and animal ecology. At the same time, these new techniques introduce computational modalities that create challenges for academic computing centers and resource providers that have historically focused on asynchronous, batch-computing paradigms. In particular, there is an emergent need for computing models that enable efficient use of specialized hardware such as graphical processing units (GPUs) in the presence of interactive workloads. In this paper, we present a comprehensive, cloud-based architecture comprised of open-source software layers to better meet the needs of modern ML processes and workloads. This framework, deployed at the Texas Advanced Computing Center and in use by various research teams, provides different interfaces at varying levels of abstraction to support and simplify the tasks of users with different backgrounds and expertise, and to efficiently leverage limited GPU resources for these tasks. We present techniques and implementation details for overcoming challenges related to developing and maintaining such an infrastructure which will be of interest to service providers and infrastructure developers alike.
引用
收藏
页数:8
相关论文
共 43 条
[1]  
[Anonymous], 2022, Tuitus: Award Abstract
[2]  
[Anonymous], 2024, tapisgpupaper/jupyterhub
[3]  
[Anonymous], 2024, cluster
[4]  
[Anonymous], 2023, Flask
[5]  
[Anonymous], 2017, Artificial Intelligence
[6]  
[Anonymous], 2021, SGX3 Fellows Journey: From Research to Software Engineering the AI-Driven HPC Resource Prediction and PEARC23 Experiences
[7]  
[Anonymous], 2024, Main Kolla Ansible Configuration File Example
[8]  
[Anonymous], 2024, National Artificial Intelligence Strategy
[9]  
[Anonymous], 2021, ICICLE AI InstituteL Intelligent CI with Computational Learning in the Environment
[10]  
[Anonymous], 2024, nova. conf Configuration File Example