XCODE: Towards Cross-Language Code Representation with Large-Scale Pre-Training

被引：6

作者：

Lin, Zehao ^{[1
]}

Li, Guodun ^{[1
]}

Zhang, Jingfeng ^{[1
]}

Deng, Yue ^{[1
]}

Zeng, Xiangji ^{[1
]}

Zhang, Yin ^{[1
]}

Wan, Yao ^{[2
]}

机构：

[1] Zhejiang Univ, Coll Comp Sci & Technol, 38 Zheda Rd, Hangzhou 310027, Zhejiang, Peoples R China

[2] Huazhong Univ Sci & Technol, Sch Comp Sci & Tech, Wuhan 430027, Hubei, Peoples R China

来源：

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY | 2022年 / 31卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Deep learning; neural networks; code representation; cross-language; pre-training; LEARN;

D O I：

10.1145/3506696

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Source code representation learning is the basis of applying artificial intelligence to many software engineering tasks such as code clone detection, algorithm classification, and code summarization. Recently, many works have tried to improve the performance of source code representation from various perspectives, e.g., introducing the structural information of programs into latent representation. However, when dealing with rapidly expanded unlabeled cross-language source code datasets from the Internet, there are still two issues. Firstly, deep learning models for many code-specific tasks still suffer from the lack of high-quality labels. Secondly, the structural differences among programming languages make it more difficult to process multiple languages in a single neural architecture. To address these issues, in this article, we propose a novel Cross-language Code representation with a large-scale pre-training (XCoDE) method. Concretely, we propose to use several abstract syntax trees and ELMo-enhanced variational autoencoders to obtain multiple pre-trained source code language models trained on about 1.5 million code snippets. To fully utilize the knowledge across programming languages, we further propose a Shared Encoder-Decoder (SED) architecture which uses the multi-teacher single-student method to transfer knowledge from the aforementioned pre-trained models to the distilled SED. The pre-trained models and SED will cooperate to better represent the source code. For evaluation, we examine our approach on three typical downstream cross-language tasks, i.e., source code translation, code clone detection, and code-to-code search, on a real-world dataset composed of programming exercises with multiple solutions. Experimental results demonstrate the effectiveness of our proposed approach on cross-language code representations. Meanwhile, our approach performs significantly better than several code representation baselines on different downstream tasks in terms of multiple automatic evaluation metrics.

引用

页数：44

共 14 条

[1] RaxCS: Towards cross-language code summarization with contrastive pre-training and retrieval augmentation
Yang, Kaiyuan
Wang, Junfeng
Song, Zihua
INFORMATION AND SOFTWARE TECHNOLOGY, 2025, 183
[2] Automating Code Review Activities by Large-Scale Pre-training
Li, Zhiyu
Lu, Shuai
Guo, Daya
Duan, Nan
Jannu, Shailesh
Jenks, Grant
Majumder, Deep
Green, Jared
Svyatkovskiy, Alexey
Fu, Shengyu
Sundaresan, Neel
PROCEEDINGS OF THE 30TH ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2022, 2022, : 1035 - 1047
[3] CanvasEmb: Learning Layout Representation with Large-scale Pre-training for Graphic Design
Xie, Yuxi
Huang, Danqing
Wang, Jinpeng
Lin, Chin-Yew
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4100 - 4108
[4] Pre-training on Large-Scale Heterogeneous Graph
Jiang, Xunqiang
Jia, Tianrui
Fang, Yuan
Shi, Chuan
Lin, Zhe
Wang, Hui
KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 756 - 766
[5] Clone Detection with Pre-training Enhanced Code Representation
Leng L.-S.
Liu S.
Tian C.-L.
Dou S.-J.
Wang Z.
Zhang M.-S.
Ruan Jian Xue Bao/Journal of Software, 2022, 33 (05): : 1758 - 1773
[6] Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing
Kasagi, Akihiko
Asaoka, Masahiro
Tabuchi, Akihiro
Oyama, Yosuke
Honda, Takumi
Sakai, Yasufumi
Dang, Thang
Tabaru, Tsuguchika
2021 NINTH INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR 2021), 2021, : 108 - 113
[7] Large-Scale Unsupervised Audio Pre-Training for Video-to-Speech Synthesis
Kefalas, Triantafyllos
Panagakis, Yannis
Pantic, Maja
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2255 - 2268
[8] Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning
Chen, Qian
Wang, Wen
Zhang, Qinglin
INTERSPEECH 2021, 2021, : 1244 - 1248
[9] Fractal geometry-based automatic generation of large-scale image database for pre-training in 3D object recognition
Yamada R.
Okayasu K.
Nakamura A.
Kataoka H.
Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2021, 87 (04): : 374 - 379
[10] Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition
Masumura, Ryo
Makishima, Naoki
Ihori, Mana
Takashima, Akihiko
Tanaka, Tomohiro
Orihashi, Shota
INTERSPEECH 2020, 2020, : 2822 - 2826

← 1 2 →