Towards Summarizing Code Snippets Using Pre-Trained Transformers

被引:0
作者
Mastropaolo, Antonio [1 ]
Ciniselli, Matteo [1 ]
Pascarella, Luca [2 ]
Tufano, Rosalia [1 ]
Aghajani, Emad [1 ]
Bavota, Gabriele [1 ]
机构
[1] Univ Svizzera Italiana, SEART Software Inst, Lugano, Switzerland
[2] Swiss Fed Inst Technol, Ctr Project Based Learning, Zurich, Switzerland
来源
PROCEEDINGS 2024 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC 2024 | 2024年
基金
欧洲研究理事会;
关键词
Software Documentation; Pre-trained Transformer Models;
D O I
10.1145/3643916.3644400
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
When comprehending code, a helping hand may come from the natural language comments documenting it that, unfortunately, are not always there. To support developers in such a scenario, several techniques have been presented to automatically generate natural language summaries for a given code. Most recent approaches exploit deep learning (DL) to automatically document classes or functions, while little effort has been devoted to more fine-grained documentation (e.g., documenting code snippets or even a single statement). Such a design choice is dictated by the availability of training data: For example, in the case of Java, it is easy to create datasets composed of pairs <method, javadoc> that can be fed to DL models to teach them how to summarize a method. Such a comment-to-code linking is instead non-trivial when it comes to inner comments documenting a few statements. In this work, we take all the steps needed to train a DL model to automatically document code snippets. First, we manually built a dataset featuring 6.6k comments that have been (i) classified based on their type (e.g., code summary, TODO), and (ii) linked to the code statements they document. Second, we used such a dataset to train a multi-task DL model taking as input a comment and being able to (i) classify whether it represents a "code summary" or not, and (ii) link it to the code statements it documents. Our model identifies code summaries with 84% accuracy and is able to link them to the documented lines of code with recall and precision higher than 80%. Third, we run this model on 10k projects, identifying and linking code summaries to the documented code. This unlocked the possibility of building a large-scale dataset of documented code snippets that have then been used to train a new DL model able to automatically document code snippets. A comparison with state-of-the-art baselines shows the superiority of the proposed approach, which however, is still far from representing an accurate solution for snippet summarization.
引用
收藏
页码:1 / 12
页数:12
相关论文
共 68 条
  • [1] Automated Documentation of Android Apps
    Aghajani, Emad
    Bavota, Gabriele
    Linares-Vasquez, Mario
    Lanza, Michele
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2021, 47 (01) : 204 - 220
  • [2] Allamanis M, 2016, PR MACH LEARN RES, V48
  • [3] [Anonymous], 2005, P ACL WORKSH INTR EX
  • [4] Automatically detecting the scopes of source code comments
    Chen, Huanchao
    Huang, Yuan
    Liu, Zhiyong
    Chen, Xiangping
    Zhou, Fan
    Luo, Xiaonan
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2019, 153 : 45 - 63
  • [5] Chin-Yew Lin, 2004, Text Summarization Branches Out, P74
  • [6] A COEFFICIENT OF AGREEMENT FOR NOMINAL SCALES
    COHEN, J
    [J]. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1960, 20 (01) : 37 - 46
  • [7] srcML: An Infrastructure for the Exploration, Analysis, and Manipulation of Source Code A Tool Demonstration
    Collard, Michael L.
    Decker, Michael John
    Maletic, Jonathan I.
    [J]. 2013 29TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE (ICSM), 2013, : 516 - 519
  • [8] Sampling Projects in GitHub for MSR Studies
    Dabic, Ozren
    Aghajani, Emad
    Bavota, Gabriele
    [J]. 2021 IEEE/ACM 18TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR 2021), 2021, : 560 - 564
  • [9] de Souza S. C. B., 2005, P 23 ANN INT C DES C, P68, DOI DOI 10.1145/1085313.1085331
  • [10] Do code and comments co-evolve?: On the relation between source code and comment changes
    Fluri, Beat
    Wuesch, Michael
    Gall, Harald C.
    [J]. 14TH WORKING CONFERENCE ON REVERSE ENGINEERING, PROCEEDINGS, 2007, : 70 - 79