PERFGEN: A Synthesis and Evaluation Framework for Performance Data using Generative AI

被引：0

作者：

Banday, Banooqa H. ^{[1
]}

Islam, Tanzima Z. ^{[1
]}

Marathe, Aniruddha ^{[2
]}

机构：

[1] Texas State Univ, San Marcos, TX 78666 USA

[2] Lawrence Livermore Natl Lab, Livermore, CA 94550 USA

来源：

2024 IEEE 48TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE, COMPSAC 2024 | 2024年

关键词：

Large Language Model; Generative Modeling; Evaluation; Scientific Data;

D O I：

10.1109/COMPSAC61105.2024.00035

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Collecting data in High-Performance Computing (HPC) is a laborious task, demanding that application scientists execute the application multiple times with different configurations. Due to the essential nature of performance modeling and root cause analysis as initial phases of performance enhancement, the data collection phase prolongs the optimization process. Motivated by this observation, we investigate the feasibility of leveraging the recent advancement in the field of generative Artificial Intelligence (AI) to synthesize performance samples. However, generating synthetic performance data introduces an additional hurdle: the absence of ground truths to assess the quality of the synthetic data. This work takes a step toward bridging this gap where we propose a framework-PERFGEN-for generating performance data and evaluating its quality using a novel metric called Dissimilarity. Our experiments with three performance and five machine learning datasets (including three classification and two regression datasets), confirm that our proposed Dissimilarity correlates with model accuracy better than three of the state-of-the-art metrics-SD quality, Kullback-Leibler Divergence (KL), and TabSyndex, demonstrating that the Dissimilarity metric strongly correlates with the quality of generated scientific data. We evaluate the quality by measuring how well the generated data enables a downstream Machine Learning (ML) task to generalize. Since performance data is a special case of scientific data-typically stored in tabular format and consisting of numerical, categorical, and ordinal features-our methodologies and metrics apply to scientific data from other domains as well.

引用

页码：188 / 197

页数：10

共 43 条

[1] Alberti Giovanni S., 2023, Continuous generative neural networks
[2] Alpaydin E., 1998, Optical Recognition of Handwritten Digits, DOI [10.24432/C50P49, DOI 10.24432/C50P49]
[3] [Anonymous], 2023, Synthetic data metrics
[4] MedGAN: Medical image translation using GANs
Armanious, Karim
Jiang, Chenming
Fischer, Marc
Kuestner, Thomas
Nikolaou, Konstantin
Gatidis, Sergios
Yang, Bin
[J]. COMPUTERIZED MEDICAL IMAGING AND GRAPHICS, 2020, 79
[5] The correlation coefficient:: An overview
Asuero, AG
Sayago, A
González, AG
[J]. CRITICAL REVIEWS IN ANALYTICAL CHEMISTRY, 2006, 36 (01) : 41 - 59
[6] Betzalel E, 2022, A study on the evaluation of generative models
[7] Bhattacharyya Arnab, 2022, On approximating total variation distance
[8] Biewald L, 2020, Experiment tracking with weights and biases
[9] Bock R., 2007, MAGIC Gamma Telescope, DOI [10.24432/C52C8B, DOI 10.24432/C52C8B]
[10] Borisov Vadim, 2023, 11 INT C LEARN REPR

← 1 2 3 4 5 →