XCODE: Towards Cross-Language Code Representation with Large-Scale Pre-Training

被引:6
|
作者
Lin, Zehao [1 ]
Li, Guodun [1 ]
Zhang, Jingfeng [1 ]
Deng, Yue [1 ]
Zeng, Xiangji [1 ]
Zhang, Yin [1 ]
Wan, Yao [2 ]
机构
[1] Zhejiang Univ, Coll Comp Sci & Technol, 38 Zheda Rd, Hangzhou 310027, Zhejiang, Peoples R China
[2] Huazhong Univ Sci & Technol, Sch Comp Sci & Tech, Wuhan 430027, Hubei, Peoples R China
基金
中国国家自然科学基金;
关键词
Deep learning; neural networks; code representation; cross-language; pre-training; LEARN;
D O I
10.1145/3506696
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Source code representation learning is the basis of applying artificial intelligence to many software engineering tasks such as code clone detection, algorithm classification, and code summarization. Recently, many works have tried to improve the performance of source code representation from various perspectives, e.g., introducing the structural information of programs into latent representation. However, when dealing with rapidly expanded unlabeled cross-language source code datasets from the Internet, there are still two issues. Firstly, deep learning models for many code-specific tasks still suffer from the lack of high-quality labels. Secondly, the structural differences among programming languages make it more difficult to process multiple languages in a single neural architecture. To address these issues, in this article, we propose a novel Cross-language Code representation with a large-scale pre-training (XCoDE) method. Concretely, we propose to use several abstract syntax trees and ELMo-enhanced variational autoencoders to obtain multiple pre-trained source code language models trained on about 1.5 million code snippets. To fully utilize the knowledge across programming languages, we further propose a Shared Encoder-Decoder (SED) architecture which uses the multi-teacher single-student method to transfer knowledge from the aforementioned pre-trained models to the distilled SED. The pre-trained models and SED will cooperate to better represent the source code. For evaluation, we examine our approach on three typical downstream cross-language tasks, i.e., source code translation, code clone detection, and code-to-code search, on a real-world dataset composed of programming exercises with multiple solutions. Experimental results demonstrate the effectiveness of our proposed approach on cross-language code representations. Meanwhile, our approach performs significantly better than several code representation baselines on different downstream tasks in terms of multiple automatic evaluation metrics.
引用
收藏
页数:44
相关论文
共 14 条
  • [1] RaxCS: Towards cross-language code summarization with contrastive pre-training and retrieval augmentation
    Yang, Kaiyuan
    Wang, Junfeng
    Song, Zihua
    INFORMATION AND SOFTWARE TECHNOLOGY, 2025, 183
  • [2] Automating Code Review Activities by Large-Scale Pre-training
    Li, Zhiyu
    Lu, Shuai
    Guo, Daya
    Duan, Nan
    Jannu, Shailesh
    Jenks, Grant
    Majumder, Deep
    Green, Jared
    Svyatkovskiy, Alexey
    Fu, Shengyu
    Sundaresan, Neel
    PROCEEDINGS OF THE 30TH ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2022, 2022, : 1035 - 1047
  • [3] CanvasEmb: Learning Layout Representation with Large-scale Pre-training for Graphic Design
    Xie, Yuxi
    Huang, Danqing
    Wang, Jinpeng
    Lin, Chin-Yew
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4100 - 4108
  • [4] Pre-training on Large-Scale Heterogeneous Graph
    Jiang, Xunqiang
    Jia, Tianrui
    Fang, Yuan
    Shi, Chuan
    Lin, Zhe
    Wang, Hui
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 756 - 766
  • [5] Clone Detection with Pre-training Enhanced Code Representation
    Leng L.-S.
    Liu S.
    Tian C.-L.
    Dou S.-J.
    Wang Z.
    Zhang M.-S.
    Ruan Jian Xue Bao/Journal of Software, 2022, 33 (05): : 1758 - 1773
  • [6] Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing
    Kasagi, Akihiko
    Asaoka, Masahiro
    Tabuchi, Akihiro
    Oyama, Yosuke
    Honda, Takumi
    Sakai, Yasufumi
    Dang, Thang
    Tabaru, Tsuguchika
    2021 NINTH INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR 2021), 2021, : 108 - 113
  • [7] Large-Scale Unsupervised Audio Pre-Training for Video-to-Speech Synthesis
    Kefalas, Triantafyllos
    Panagakis, Yannis
    Pantic, Maja
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2255 - 2268
  • [8] Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning
    Chen, Qian
    Wang, Wen
    Zhang, Qinglin
    INTERSPEECH 2021, 2021, : 1244 - 1248
  • [9] Fractal geometry-based automatic generation of large-scale image database for pre-training in 3D object recognition
    Yamada R.
    Okayasu K.
    Nakamura A.
    Kataoka H.
    Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2021, 87 (04): : 374 - 379
  • [10] Phoneme-to-Grapheme Conversion Based Large-Scale Pre-Training for End-to-End Automatic Speech Recognition
    Masumura, Ryo
    Makishima, Naoki
    Ihori, Mana
    Takashima, Akihiko
    Tanaka, Tomohiro
    Orihashi, Shota
    INTERSPEECH 2020, 2020, : 2822 - 2826