CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model

被引:0
作者
Jung, Tae-Hwan [1 ]
机构
[1] Kyung Hee Univ, Seoul, South Korea
来源
NLP4PROG 2021: THE 1ST WORKSHOP ON NATURAL LANGUAGE PROCESSING FOR PROGRAMMING (NLP4PROG 2021) | 2021年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In version control using Git, the commit message is a document that summarizes source code changes in natural language. A good commit message clearly shows the source code changes, so this enhances collaboration between developers. To write a good commit message, the message should briefly summarize the source code changes, which takes a lot of time and effort. Therefore, a lot of research has been studied to automatically generate a commit message when a code modification is given. However, in most of the studies so far, there was no curated dataset for code modifications (additions and deletions) and corresponding commit messages in various programming languages. The model also had difficulty learning the contextual representation between code modification and natural language. To solve these problems, we propose the following two methods: (1) We collect code modification and corresponding commit messages in Github for six languages (Python, PHP, Go, Java, JavaScript, and Ruby) and release a well-organized 345K pair dataset. (2) In order to resolve the large gap in contextual representation between programming language (PL) and natural language (NL), we use CodeBERT (Feng et al., 2020), a pre-trained language model (PLM) for programming code, as an initial model. Using two methods leads to successful results in the commit message generation task. Also, this is the first research attempt in fine-tuning commit generation using various programming languages and code PLM.
引用
收藏
页码:26 / 33
页数:8
相关论文
共 22 条
[1]  
Clark Kevin, 2020, ELECTRA: Pretraining text encoders as discriminators rather than generators, DOI [DOI 10.48550/ARXIV.2003.10555, 10.48550/arXiv.2003.10555]
[2]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[3]  
Feng Zhangyin, 2020, ABS200208155 ARXIV, V2020, P1536, DOI 10.18653/v1/2020.findings-emnlp.139
[4]  
Husain Hamel, 2019, CoRR abs/1909.09436
[5]  
Jiang SY, 2017, IEEE INT CONF AUTOM, P135, DOI 10.1109/ASE.2017.8115626
[6]   Towards Automatic Generation of Short Summaries of Commits [J].
Jiang, Siyuan ;
McMillan, Collin .
2017 IEEE/ACM 25TH INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC), 2017, :320-323
[7]  
Liu S., 2020, IEEE Transactions on Software Engineering
[8]  
Liu Yinhan, 2019, CoRR, DOI DOI 10.48550/ARXIV.1907.11692
[9]   Neurala-Machine-Transiation-Based Commit Message Generation: How Far Are We? [J].
Liu, Zhongxin ;
Xia, Xin ;
Hassan, Ahmed E. ;
Lo, David ;
Xing, Zhenchang ;
Wang, Xinyu .
PROCEEDINGS OF THE 2018 33RD IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMTED SOFTWARE ENGINEERING (ASE' 18), 2018, :373-384
[10]   A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes [J].
Loyola, Pablo ;
Marrese-Taylor, Edison ;
Matsuo, Yutaka .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, :287-292