Reassessing Automatic Evaluation Metrics for Code Summarization Tasks

被引：58

作者：

Roy, Devjeet ^{[1
]}

Fakhoury, Sarah ^{[1
]}

Arnaoudova, Venera ^{[1
]}

机构：

[1] Washington State Univ, Pullman, WA 99164 USA

来源：

PROCEEDINGS OF THE 29TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '21) | 2021年

关键词：

automatic evaluation metrics; code summarization; machine translation;

D O I：

10.1145/3468264.3468588

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

In recent years, research in the domain of source code summarization has adopted data-driven techniques pioneered in machine translation (MT). Automatic evaluation metrics such as BLEU, METEOR, and ROUGE, are fundamental to the evaluation of MT systems and have been adopted as proxies of human evaluation in the code summarization domain. However, the extent to which automatic metrics agree with the gold standard of human evaluation has not been evaluated on code summarization tasks. Despite this, marginal improvements in metric scores are often used to discriminate between the performance of competing summarization models. In this paper, we present a critical exploration of the applicability and interpretation of automatic metrics as evaluation techniques for code summarization tasks. We conduct an empirical study with 226 human annotators to assess the degree to which automatic metrics reflect human evaluation. Results indicate that metric improvements of less than 2 points do not guarantee systematic improvements in summarization quality, and are unreliable as proxies of human evaluation. When the difference between metric scores for two summarization approaches increases but remains within 5 points, some metrics such as METEOR and chrF become highly reliable proxies, whereas others, such as corpus BLEU, remain unreliable. Based on these findings, we make several recommendations for the use of automatic metrics to discriminate model performance in code summarization.

引用

页码：1105 / 1116

页数：12

共 67 条

[1] Generating summaries for methods of event-driven programs: An Android case study [J].

Aghamohammadi, Alireza ;

Izadi, Maliheh ;

Heydarnoori, Abbas .

JOURNAL OF SYSTEMS AND SOFTWARE, 2020, 170

[2]

Ahmad Wasi, 2020, P 58 ANN M ASS COMP, P4998, DOI [10.18653/v1/2020.acl-main.449, DOI 10.18653/V1/2020.ACL-MAIN.449]

[3] Generating Pseudo-Code from Source Code Using Deep Learning [J].

Alhefdhi, Abdulaziz ;

Dam, Hoa Khanh ;

Hata, Hideaki ;

Ghose, Aditya .

2018 25TH AUSTRALASIAN SOFTWARE ENGINEERING CONFERENCE (ASWEC), 2018, :21-25

[4]

Allamanis M, 2016, PR MACH LEARN RES, V48

[5]

Alon U., 2018, P INT C LEARN REPR

[6]

[Anonymous], 2005, P ACL WORKSH INTR EX

[7]

[Anonymous], 2018, P 56 ANN M ASS COMP, DOI [10.18653/v1/P18-1128, DOI 10.18653/V1/P18-1128]

[8]

Chen Boxing, 2014, P 9 WORKSH STAT MACH, P362

[9] Neural Comment Generation for Source Code with Auxiliary Code Classification Task [J].

Chen, Minghao ;

Wan, Xiaojun .

2019 26TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE (APSEC), 2019, :522-529

[10] A Neural Framework for Retrieval and Summarization of Source Code [J].

Chen, Qingying ;

Zhou, Minghui .

PROCEEDINGS OF THE 2018 33RD IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMTED SOFTWARE ENGINEERING (ASE' 18), 2018, :826-831

← 1 2 3 4 5 6 7 →