Data leakage detection in machine learning code: transfer learning, active learning, or low-shot prompting?

被引:0
作者
Alturayeif, Nouf [1 ,2 ]
Hassine, Jameleddine [1 ,3 ]
机构
[1] King Fahd Univ Petr & Minerals, Informat & Comp Sci Dept, Dhahran, Saudi Arabia
[2] Imam Abdulrahman Bin Faisal Univ, Comp Dept, Dammam, Saudi Arabia
[3] King Fahd Univ Petr & Minerals, Interdisciplinary Res Ctr Intelligent Secure Syst, Dhahran, Saudi Arabia
关键词
Data leakage; Code quality; Transfer learning; Active learning; Low-shot prompting;
D O I
10.7717/peerj-cs.2730
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level (i.e., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability.
引用
收藏
页数:29
相关论文
共 49 条
[1]  
[Anonymous], 2023, OpenAI GPT-4 Technical Report
[2]   The Art and Practice of Data Science Pipelines [J].
Biswas, Sumon ;
Wardat, Mohammad ;
Rajan, Hridesh .
2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, :2091-2103
[3]  
Brown TB, 2020, ADV NEUR IN, V33
[4]  
Burkov A., 2020, Machine learning engineering, V1
[5]   A Survey on Evaluation of Large Language Models [J].
Chang, Yupeng ;
Wang, Xu ;
Wang, Jindong ;
Wu, Yuan ;
Yang, Linyi ;
Zhu, Kaijie ;
Chen, Hao ;
Yi, Xiaoyuan ;
Wang, Cunxiang ;
Wang, Yidong ;
Ye, Wei ;
Zhang, Yue ;
Chang, Yi ;
Yu, Philip S. ;
Yang, Qiang ;
Xie, Xing .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2024, 15 (03)
[6]   What's Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities [J].
Chattopadhyay, Souti ;
Prasad, Ishita ;
Henley, Austin Z. ;
Sarma, Anita ;
Barik, Titus .
PROCEEDINGS OF THE 2020 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI'20), 2020,
[7]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[8]  
Chorev S., 2022, The Journal of Machine Learning Research, V23, P12990
[9]   Active learning with statistical models [J].
Cohn, DA ;
Ghahramani, Z ;
Jordan, MI .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1996, 4 :129-145
[10]  
Cousot P, 1977, POPL, P238, DOI DOI 10.1145/512950.512973