Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors

被引：0

作者：

Tang, Liyan ^{[1
]}

Goyal, Tanya ^{[1
]}

Fabbri, Alexander R. ^{[2
]}

Laban, Philippe ^{[2
]}

Xu, Jiacheng ^{[1
,2
]}

Yavuz, Semih ^{[2
]}

Kryscinski, Wojciech ^{[2
]}

Rousseau, Justin F. ^{[1
]}

Durrett, Greg ^{[1
]}

机构：

[1] Univ Texas Austin, Austin, TX 78712 USA

[2] Salesforce AI Res, San Francisco, CA USA

来源：

PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1 | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The propensity of abstractive summarization models to make factual errors has been studied extensively, including design of metrics to detect factual errors and annotation of errors in current systems' outputs. However, the ever-evolving nature of summarization systems, metrics, and annotated benchmarks makes factuality evaluation a moving target, and drawing clear comparisons among metrics has become increasingly difficult. In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model. We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models. Critically, our analysis shows that much of the recent improvement in the factuality detection space has been on summaries from older (pre-Transformer) models instead of more relevant recent summarization models. We further perform a finer-grained analysis per error-type and find similar performance variance across error types for different factuality metrics. Our results show that no one metric is superior in all settings or for all error types, and we provide recommendations for best practices given these insights.(1)

引用

页码：11626 / 11644

页数：19

共 18 条

[1]

Bao SQ, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P85

[2]

Cao SY, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P6633

[3]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[4]

Durmus Esin., 2020, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, P5055

[5]

Fabbri Alexander R., 2021, QAFACTEVAL IMPROVED

[6]

Falke T, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2214

[7]

Hermann Karl Moritz, 2015, P C NEUR INF PROC SY

[8] View-independent representation with frame interpolation method for skeleton-based human action recognition [J].

Jiang, Yingguo ;

Xu, Jun ;

Zhang, Tong .

INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2020, 11 (12) :2625-2636

[9] SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization [J].

Laban, Philippe ;

Schnabel, Tobias ;

Bennett, Paul N. N. ;

Hearst, Marti A. A. .

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 :163-177

[10]

Lewis M., 2020, P 58 ANN M ASS COMPU, P7871, DOI DOI 10.18653/V1/2020.ACL-MAIN.703

← 1 2 →