TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation

被引:2
作者
Xian, Zixiang [1 ]
Huang, Rubing [1 ]
Towey, Dave [2 ]
Fang, Chunrong [3 ]
Chen, Zhenyu [3 ]
机构
[1] Macau Univ Sci & Technol, Sch Comp Sci & Engn, Macau 999078, Peoples R China
[2] Univ Nottingham Ningbo China, Sch Comp Sci, Ningbo 315100, Zhejiang, Peoples R China
[3] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210093, Peoples R China
关键词
Codes; Task analysis; Self-supervised learning; Syntactics; Semantics; Vectors; Training; Code embedding; transformer; abstract syntax tree; contrastive learning; NETWORKS;
D O I
10.1109/TSE.2024.3393419
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Artificial intelligence (AI) has revolutionized software engineering (SE) by enhancing software development efficiency. The advent of pre-trained models (PTMs) leveraging transfer learning has significantly advanced AI for SE. However, existing PTMs that operate on individual code tokens suffer from several limitations: They are costly to train and fine-tune; and they rely heavily on labeled data for fine-tuning on task-specific datasets. In this paper, we present TransformCode, a novel framework that learns code embeddings in a contrastive learning manner. Our framework is encoder-agnostic and language-agnostic, which means that it can leverage any encoder model and handle any programming language. We also propose a novel data-augmentation technique called abstract syntax tree (AST) transformation, which applies syntactic and semantic transformations to the original code snippets, to generate more diverse and robust samples for contrastive learning. Our framework has several advantages over existing methods: (1) It is flexible and adaptable, because it can easily be extended to other downstream tasks that require code representation (such as code-clone detection and classification); (2) it is efficient and scalable, because it does not require a large model or a large amount of training data, and it can support any programming language; (3) it is not limited to unsupervised learning, but can also be applied to some supervised learning tasks by incorporating task-specific labels or objectives; and (4) it can also adjust the number of encoder parameters based on computing resources. We evaluate our framework on several code-related tasks, and demonstrate its effectiveness and superiority over the state-of-the-art methods such as SourcererCC, Code2vec, and InferCode.
引用
收藏
页码:1600 / 1619
页数:20
相关论文
共 73 条
[51]   Schemes for Labeling Semantic Code Clones using Machine Learning [J].
Sheneamer, Abdullah ;
Hazazi, Hanan ;
Roy, Swarup ;
Kalita, Jugal .
2017 16TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2017, :981-985
[52]   Towards a Big Data Curated Benchmark of Inter-Project Code Clones [J].
Svajlenko, Jeffrey ;
Islam, Judith F. ;
Keivanloo, Iman ;
Roy, Chanchal K. ;
Mia, Mohammad Mamun .
2014 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME), 2014, :476-480
[53]   AST-Trans: Code Summarization with Efficient Tree-Structured Attention [J].
Tang, Ze ;
Shen, Xiaoyu ;
Li, Chuanyi ;
Ge, Jidong ;
Huang, Liguo ;
Zhu, Zhelin ;
Luo, Bin .
2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, :150-162
[54]   Contrastive Multiview Coding [J].
Tian, Yonglong ;
Krishnan, Dilip ;
Isola, Phillip .
COMPUTER VISION - ECCV 2020, PT XI, 2020, 12356 :776-794
[55]   Deep Learning Similarities from Different Representations of Source Code [J].
Tufano, Michele ;
Watson, Cody ;
Bavota, Gabriele ;
Di Penta, Massimiliano ;
White, Martin ;
Poshyvanyk, Denys .
2018 IEEE/ACM 15TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR), 2018, :542-553
[56]  
van den Oord A, 2019, Arxiv, DOI [arXiv:1807.03748, DOI 10.48550/ARXIV.1807.03748]
[57]  
van der Maaten L, 2008, J MACH LEARN RES, V9, P2579
[58]  
Vaswani A, 2017, ADV NEUR IN, V30
[59]  
Vig J, 2019, PROCEEDINGS OF THE 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: SYSTEM DEMONSTRATIONS, (ACL 2019), P37
[60]  
Wang WH, 2020, PROCEEDINGS OF THE 2020 IEEE 27TH INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION, AND REENGINEERING (SANER '20), P261, DOI [10.1109/SANER48275.2020.9054857, 10.1109/saner48275.2020.9054857]