Contrastive Code-Comment Pre-training

被引:0
作者
Pei, Xiaohuan [1 ]
Liu, Daochang [1 ]
Qian, Luo [1 ]
Xu, Chang [1 ]
机构
[1] Univ Sydney, Sch Comp Sci, Fac Engn, Sydney, Australia
来源
2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM) | 2022年
基金
澳大利亚研究理事会;
关键词
contrastive learning; representation learning; pre-training; programming language processing;
D O I
10.1109/ICDM54844.2022.00050
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pre-trained models for Natural Languages (NL) have been recently shown to transfer well to Programming Languages (PL) and largely benefit different intelligence coderelated tasks, such as code search, clone detection, programming translation and code document generation. However, existing pre-trained methods for programming languages are mainly conducted by masked language modeling and next sentence prediction at token or graph levels. This restricted form limits their performance and transferability since PL and NL have different syntax rules and the downstream tasks require a multimodal representation. Here we introduce C3P, a Contrastive Code-Comment Pre-training approach, to solve various downstream tasks by pre-training the multi-representation features on both programming and natural syntax. The model encodes the code syntax and natural language description (comment) by two encoders and the encoded embeddings are projected into a multi-modal space for learning the latent representation. In the latent space, C3P jointly trains the code and comment encoders by the symmetric loss function, which aims to maximize the cosine similarity of the correct code-comment pairs while minimizing the similarity of unrelated pairs. We verify the empirical performance of the proposed pre-trained models on multiple downstream code-related tasks. The comprehensive experiments demonstrate that C3P outperforms previous work on the understanding tasks of code search and code clone, as well as the generation tasks of programming translation and document generation. Furthermore, we validate the transferability of C3P to the new programming language which is not seen in the pre-training stage. The results show our model surpasses all supervised methods and in some programming language cases even outperforms prior pre-trained approaches. Code is available at https://github.com/TerryPei/C3P.
引用
收藏
页码:398 / 407
页数:10
相关论文
共 41 条
  • [1] Ahmad WU, 2021, Arxiv, DOI [arXiv:2103.06333, 10.48550/arXiv.2103.06333]
  • [2] Alon U., 2018, arXiv
  • [3] Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code
    Anh Tuan Nguyen
    Tung Thanh Nguyen
    Nguyen, Tien N.
    [J]. 2015 30TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE), 2015, : 585 - 596
  • [4] Chen M., 2021, arXiv
  • [5] CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification
    Conde, Marcos, V
    Turgutlu, Kerem
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 3951 - 3955
  • [6] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
  • [7] Duan XY, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P3044
  • [8] Feng ZY, 2020, Arxiv, DOI [arXiv:2002.08155, 10.48550/arXiv.2002.08155]
  • [9] Garcelon E, 2020, ADV NEUR IN, V33
  • [10] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1007/978-3-642-24797-2, 10.1162/neco.1997.9.1.1]