A Comprehensive Cloud Architecture for Machine Learning-enabled Research

被引:0
作者
Stubbs, Joe [1 ]
Freeman, Nathan [1 ]
Indrakusuma, Dhanny [1 ]
Garcia, Christian [1 ]
Halbach, Francois [1 ]
Hammock, Cody [1 ]
Curbelo, Gilbert [1 ]
Jamthe, Anagha [1 ]
Packard, Mike [1 ]
Fields, Alex [1 ]
机构
[1] Texas Adv Comp Ctr, Austin, TX 78758 USA
来源
PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING 2024, PEARC 2024 | 2024年
基金
美国国家科学基金会;
关键词
GPUs; Cloud Computing; Machine Learning;
D O I
10.1145/3626203.3670525
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The success of machine learning (ML) algorithms, and deep learning in particular, is having a transformative impact on a wide range of research disciplines, from astronomy, materials science, and climate change to bioinformatics, computational health, and animal ecology. At the same time, these new techniques introduce computational modalities that create challenges for academic computing centers and resource providers that have historically focused on asynchronous, batch-computing paradigms. In particular, there is an emergent need for computing models that enable efficient use of specialized hardware such as graphical processing units (GPUs) in the presence of interactive workloads. In this paper, we present a comprehensive, cloud-based architecture comprised of open-source software layers to better meet the needs of modern ML processes and workloads. This framework, deployed at the Texas Advanced Computing Center and in use by various research teams, provides different interfaces at varying levels of abstraction to support and simplify the tasks of users with different backgrounds and expertise, and to efficiently leverage limited GPU resources for these tasks. We present techniques and implementation details for overcoming challenges related to developing and maintaining such an infrastructure which will be of interest to service providers and infrastructure developers alike.
引用
收藏
页数:8
相关论文
共 43 条
[21]  
Freeman, 2023, Detailed Functional Overview of an API and Workflow Engine for Scientific Researach Computing, DOI [10.1145/3569951.3593609, DOI 10.1145/3569951.3593609]
[22]  
Garcia Christian, 2020, 16 INT C GRID CLOUD
[23]  
Garcia Richard Cardone Christian, 2023, SCI GAT 2023 ANN C, DOI [10.5281/zenodo.10034631, DOI 10.5281/ZENODO.10034631]
[24]  
github, 2017, Transformers
[25]  
github, 2020, Hugging Face Hub
[26]  
github, 2017, kubespawner
[27]   Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims by ClaimBuster [J].
Hassan, Naeemul ;
Arslan, Fatma ;
Li, Chengkai ;
Tremayne, Mark .
KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, :1803-1812
[28]  
Hugging Face, 2017, About us
[29]  
Indrakusuma Dhanny, 2023, SCI GAT 2023 ANN C, DOI [10.5281/zenodo.10055681, DOI 10.5281/ZENODO.10055681]
[30]  
Kubernetes, 2023, ABOUT US