MIXCODE: Enhancing Code Classification by Mixup-Based Data Augmentation

被引:8
作者
Dong, Zeming [1 ]
Hu, Qiang [2 ]
Guo, Yuejun [3 ]
Cordy, Maxime [2 ]
Papadakis, Mike [2 ]
Zhang, Zhenya [1 ]
Le Traon, Yves [2 ]
Zhao, Jianjun [1 ]
机构
[1] Kyushu Univ, Fukuoka, Japan
[2] Univ Luxembourg, Luxembourg, Luxembourg
[3] Luxembourg Inst Sci & Technol, Luxembourg, Luxembourg
来源
2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING, SANER | 2023年
关键词
Data augmentation; Mixup; Source code analysis;
D O I
10.1109/SANER56733.2023.00043
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Inspired by the great success of Deep Neural Networks (DNNs) in natural language processing (NLP), DNNs have been increasingly applied in source code analysis and attracted significant attention from the software engineering community. Due to its data-driven nature, a DNN model requires massive and high-quality labeled training data to achieve expert-level performance. Collecting such data is often not hard, but the labeling process is notoriously laborious. The task of DNN-based code analysis even worsens the situation because source code labeling also demands sophisticated expertise. Data augmentation has been a popular approach to supplement training data in domains such as computer vision and NLP. However, existing data augmentation approaches in code analysis adopt simple methods, such as data transformation and adversarial example generation, thus bringing limited performance superiority. In this paper, we propose a data augmentation approach MIXCODE that aims to effectively supplement valid training data, inspired by the recent advance named Mixup in computer vision. Specifically, we first utilize multiple code refactoring methods to generate transformed code that holds consistent labels with the original data. Then, we adapt the Mixup technique to mix the original code with the transformed code to augment the training data. We evaluate MIXCODE on two programming languages (Java and Python), two code tasks (problem classification and bug detection), four benchmark datasets (JAVA250, Python800, CodRepl, and Refactory), and seven model architectures (including two pre-trained models CodeBERT and GraphCodeBERT). Experimental results demonstrate that MIXCODE outperforms the baseline data augmentation approach by up to 6.24% in accuracy and 26.06% in robustness.
引用
收藏
页码:379 / 390
页数:12
相关论文
共 79 条
  • [1] Allamanis M, 2021, ADV NEUR IN, V34
  • [2] Allamanis M, 2018, Arxiv, DOI arXiv:1711.00740
  • [3] code2vec: Learning Distributed Representations of Code
    Alon, Uri
    Zilberstein, Meital
    Levy, Omer
    Yahav, Eran
    [J]. PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2019, 3 (POPL):
  • [4] Neuro-Symbolic Program Corrector for Introductory Programming Assignments
    Bhatia, Sahil
    Kohli, Pushmeet
    Singh, Rishabh
    [J]. PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2018, : 60 - 70
  • [5] Bielik Pavol, 2020, P 37 INT C MACHINE L, V119, P896
  • [6] Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations
    Bui, Nghi D. Q.
    Yu, Yijun
    Jiang, Lingxiao
    [J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 511 - 521
  • [7] InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees
    Bui, Nghi D. Q.
    Yu, Yijun
    Jiang, Lingxiao
    [J]. 2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021), 2021, : 1186 - 1197
  • [8] Buratti L, 2020, Arxiv, DOI arXiv:2006.12641
  • [9] Chen JA, 2020, Arxiv, DOI arXiv:2004.12239
  • [10] Chen Z., 2021, Advances in Neural Information Processing Systems, V34, p23 089