Cross-Modal Contrastive Learning for Code Search

被引:7
作者
Shi, Zejian [1 ]
Xiong, Yun [1 ,2 ]
Zhang, Xiaolong [1 ]
Zhang, Yao [1 ]
Li, Shanshan [3 ]
Zhu, Yangyong [1 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Data Sci, Shanghai, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Natl Univ Def Technol, Sch Comp, Changsha, Peoples R China
来源
2022 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2022) | 2022年
基金
中国国家自然科学基金;
关键词
code search; code representation; data augmentation; contrastive learning;
D O I
10.1109/ICSME55016.2022.00017
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Code search aims to retrieve code snippets from natural language queries, which serves as a core technology to improve development efficiency. Previous approaches have achieved promising results to learn code and query representations by using BERT-based pre-trained models which, however, leads to semantic collapse problems, i.e. native representations of code and query clustering in a high similarity interval. In this paper, we propose CrossCS, a cross-modal contrastive learning method for code search, to improve the representations of code and query by explicit fine-grained contrastive objectives. Specifically, we design a novel and effective contrastive objective that considers not only the similarity between modalities, but also the similarity within modalities. To maintain semantic consistency of code snippets with different names of functions and variables, we use data augmentation to rename functions and variables to meaningless tokens, which enables us to add comparisons between code and augmented code within modalities. Moreover, in order to further improve the effectiveness of pre-trained models, we rank candidate code snippets using similarity scores weighted by retrieval scores and classification scores. Comprehensive experiments demonstrate that our method can significantly improve the effectiveness of pre-trained models for code search.
引用
收藏
页码:94 / 105
页数:12
相关论文
共 58 条
[1]  
Brandt J, 2010, CHI2010: PROCEEDINGS OF THE 28TH ANNUAL CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, VOLS 1-4, P513
[2]  
Bromley J., 1993, International Journal of Pattern Recognition and Artificial Intelligence, V7, P669, DOI 10.1142/S0218001493000339
[3]   Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations [J].
Bui, Nghi D. Q. ;
Yu, Yijun ;
Jiang, Lingxiao .
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, :511-521
[4]   When Deep Learning Met Code Search [J].
Cambronero, Jose ;
Li, Hongyu ;
Kim, Seohyun ;
Sen, Koushik ;
Chandra, Satish .
ESEC/FSE'2019: PROCEEDINGS OF THE 2019 27TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, 2019, :964-974
[5]  
Chan W. K., 2012, P ACM SIGSOFT 20 INT, P1, DOI DOI 10.1145/2393596.2393606
[6]  
Chatterjee S, 2009, LECT NOTES COMPUT SC, V5503, P385
[7]   Enhanced LSTM for Natural Language Inference [J].
Chen, Qian ;
Zhu, Xiaodan ;
Ling, Zhenhua ;
Wei, Si ;
Jiang, Hui ;
Inkpen, Diana .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :1657-1668
[8]  
Chen T, 2020, PR MACH LEARN RES, V119
[9]  
Chen XL, 2020, Arxiv, DOI arXiv:2003.04297
[10]  
Ding YRB, 2022, Arxiv, DOI arXiv:2110.03868