Contrastive Code-Comment Pre-training

被引：0

作者：

Pei, Xiaohuan ^{[1
]}

Liu, Daochang ^{[1
]}

Qian, Luo ^{[1
]}

Xu, Chang ^{[1
]}

机构：

[1] Univ Sydney, Sch Comp Sci, Fac Engn, Sydney, Australia

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM) | 2022年

基金：

澳大利亚研究理事会;

关键词：

contrastive learning; representation learning; pre-training; programming language processing;

D O I：

10.1109/ICDM54844.2022.00050

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Pre-trained models for Natural Languages (NL) have been recently shown to transfer well to Programming Languages (PL) and largely benefit different intelligence coderelated tasks, such as code search, clone detection, programming translation and code document generation. However, existing pre-trained methods for programming languages are mainly conducted by masked language modeling and next sentence prediction at token or graph levels. This restricted form limits their performance and transferability since PL and NL have different syntax rules and the downstream tasks require a multimodal representation. Here we introduce C3P, a Contrastive Code-Comment Pre-training approach, to solve various downstream tasks by pre-training the multi-representation features on both programming and natural syntax. The model encodes the code syntax and natural language description (comment) by two encoders and the encoded embeddings are projected into a multi-modal space for learning the latent representation. In the latent space, C3P jointly trains the code and comment encoders by the symmetric loss function, which aims to maximize the cosine similarity of the correct code-comment pairs while minimizing the similarity of unrelated pairs. We verify the empirical performance of the proposed pre-trained models on multiple downstream code-related tasks. The comprehensive experiments demonstrate that C3P outperforms previous work on the understanding tasks of code search and code clone, as well as the generation tasks of programming translation and document generation. Furthermore, we validate the transferability of C3P to the new programming language which is not seen in the pre-training stage. The results show our model surpasses all supervised methods and in some programming language cases even outperforms prior pre-trained approaches. Code is available at https://github.com/TerryPei/C3P.

引用

页码：398 / 407

页数：10

共 41 条

[11] Deep Code Search
Gu, Xiaodong
Zhang, Hongyu
Kim, Sunghun
[J]. PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2018, : 933 - 944
[12] Guo DY, 2021, Arxiv, DOI arXiv:2009.08366
[13] PyART: Python']Python API Recommendation in Real-Time
He, Xincheng
Xu, Lei
Zhang, Xiangyu
Hao, Rui
Feng, Yang
Xu, Baowen
[J]. 2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021), 2021, : 1634 - 1645
[14] Husain H, 2020, Arxiv, DOI arXiv:1909.09436
[15] Iter Dan, 2020, P 58 ANN M ASS COMPU, P4859
[16] Jiang LX, 2007, PROC INT CONF SOFTW, P96
[17] Kanade Aditya, 2019, Pre-trained contextual embedding of source code
[18] Code Prediction by Feeding Trees to Transformers
Kim, Seohyun
Zhao, Jinman
Tian, Yuchi
Chandra, Satish
[J]. 2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021), 2021, : 150 - 162
[19] Klein T., 2020, P 58 ANN M ASS COMP, P7517, DOI [DOI 10.18653/V1/2020.ACL-MAIN.671, 10.18653/v1/2020.acl-main.671]
[20] Koehn P, 2003, HLT-NAACL 2003: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, P127

← 1 2 3 4 5 →