Contrastive Code-Comment Pre-training

被引:0
作者
Pei, Xiaohuan [1 ]
Liu, Daochang [1 ]
Qian, Luo [1 ]
Xu, Chang [1 ]
机构
[1] Univ Sydney, Sch Comp Sci, Fac Engn, Sydney, Australia
来源
2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM) | 2022年
基金
澳大利亚研究理事会;
关键词
contrastive learning; representation learning; pre-training; programming language processing;
D O I
10.1109/ICDM54844.2022.00050
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pre-trained models for Natural Languages (NL) have been recently shown to transfer well to Programming Languages (PL) and largely benefit different intelligence coderelated tasks, such as code search, clone detection, programming translation and code document generation. However, existing pre-trained methods for programming languages are mainly conducted by masked language modeling and next sentence prediction at token or graph levels. This restricted form limits their performance and transferability since PL and NL have different syntax rules and the downstream tasks require a multimodal representation. Here we introduce C3P, a Contrastive Code-Comment Pre-training approach, to solve various downstream tasks by pre-training the multi-representation features on both programming and natural syntax. The model encodes the code syntax and natural language description (comment) by two encoders and the encoded embeddings are projected into a multi-modal space for learning the latent representation. In the latent space, C3P jointly trains the code and comment encoders by the symmetric loss function, which aims to maximize the cosine similarity of the correct code-comment pairs while minimizing the similarity of unrelated pairs. We verify the empirical performance of the proposed pre-trained models on multiple downstream code-related tasks. The comprehensive experiments demonstrate that C3P outperforms previous work on the understanding tasks of code search and code clone, as well as the generation tasks of programming translation and document generation. Furthermore, we validate the transferability of C3P to the new programming language which is not seen in the pre-training stage. The results show our model surpasses all supervised methods and in some programming language cases even outperforms prior pre-trained approaches. Code is available at https://github.com/TerryPei/C3P.
引用
收藏
页码:398 / 407
页数:10
相关论文
共 41 条
  • [11] Deep Code Search
    Gu, Xiaodong
    Zhang, Hongyu
    Kim, Sunghun
    [J]. PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2018, : 933 - 944
  • [12] Guo DY, 2021, Arxiv, DOI arXiv:2009.08366
  • [13] PyART: Python']Python API Recommendation in Real-Time
    He, Xincheng
    Xu, Lei
    Zhang, Xiangyu
    Hao, Rui
    Feng, Yang
    Xu, Baowen
    [J]. 2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021), 2021, : 1634 - 1645
  • [14] Husain H, 2020, Arxiv, DOI arXiv:1909.09436
  • [15] Iter Dan, 2020, P 58 ANN M ASS COMPU, P4859
  • [16] Jiang LX, 2007, PROC INT CONF SOFTW, P96
  • [17] Kanade Aditya, 2019, Pre-trained contextual embedding of source code
  • [18] Code Prediction by Feeding Trees to Transformers
    Kim, Seohyun
    Zhao, Jinman
    Tian, Yuchi
    Chandra, Satish
    [J]. 2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021), 2021, : 150 - 162
  • [19] Klein T., 2020, P 58 ANN M ASS COMP, P7517, DOI [DOI 10.18653/V1/2020.ACL-MAIN.671, 10.18653/v1/2020.acl-main.671]
  • [20] Koehn P, 2003, HLT-NAACL 2003: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, P127