CONCORD: Clone-Aware Contrastive Learning for Source Code

被引:2
作者
Ding, Yangruibo [1 ]
Chakraborty, Saikat [2 ]
Buratti, Luca [3 ]
Pujar, Saurabh [3 ]
Morari, Alessandro [3 ]
Kaiser, Gail [1 ]
Ray, Baishakhi [1 ]
机构
[1] Columbia Univ, New York, NY 10027 USA
[2] Microsoft Res, Redmond, WA USA
[3] IBM Res, Yorktown Hts, NY USA
来源
PROCEEDINGS OF THE 32ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2023 | 2023年
关键词
Source Code Pre-training; Code Clone; Bug Detection; NEURAL-NETWORKS;
D O I
10.1145/3597926.3598035
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection. While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it is also essential to factor in how developers code day-to-day for learning general-purpose representation. On the one hand, human developers tend to write repetitive programs referencing existing code snippets from the current codebase or online resources (e.g., Stack Overflow website) rather than implementing functions from scratch; such behaviors result in a vast number of code clones. In contrast, a deviant clone by mistake might trigger malicious program behaviors. Thus, as a proxy to incorporate developers' coding behavior into the pre-training scheme, we propose to include code clones and their deviants. In particular, we propose CONCORD, a self-supervised pre-training strategy to place benign clones closer in the representation space while moving deviants further apart. We show that CONCORD's clone-aware pre-training drastically reduces the need for expensive pre-training resources while improving the performance of downstream SE tasks. We also empirically demonstrate that CONCORD can improve existing pre-trained models to learn better representations that consequently become more efficient in both identifying semantically equivalent programs and differentiating buggy from non-buggy code.
引用
收藏
页码:26 / 38
页数:13
相关论文
共 79 条
[1]  
Abadi M, 2016, ACM SIGPLAN NOTICES, V51, P1, DOI [10.1145/3022670.2976746, 10.1145/2951913.2976746]
[2]  
Ahmad WU, 2021, 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), P2655
[3]   Multilingual training for Software Engineering [J].
Ahmed, Toufique ;
Devanbu, Premkumar .
2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, :1443-1455
[4]  
Allamanis M., 2021, NeurIPS
[5]  
[Anonymous], 2022, CVE-2022-23559
[6]  
Athiwaratkun B., 2023, INT C LEARN REPR
[7]  
Austin Jacob., 2021, arXiv, DOI 10.48550/arXiv.2108.07732
[8]  
Brown TB, 2020, Arxiv, DOI [arXiv:2005.14165, DOI 10.48550/ARXIV.2005.14165]
[9]  
BAKER BS, 1995, SECOND WORKING CONFERENCE ON REVERSE ENGINEERING, PROCEEDINGS, P86, DOI 10.1109/WCRE.1995.514697
[10]  
Balakrishnan Gogul, 2020, ICML 2020