Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries

被引：0

作者：

Tang, Xiangru ^{[1
]}

Fabbri, Alexander R. ^{[1
]}

Mao, Ziming ^{[1
]}

Adams, Griffin ^{[2
]}

Wang, Borui ^{[1
]}

Li, Haoran ^{[3
]}

Mehdad, Yashar ^{[3
]}

Radev, Dragomir ^{[1
]}

机构：

[1] Yale Univ, New Haven, CT 06520 USA

[2] Columbia Univ, New York, NY 10027 USA

[3] Facebook AI, New York, NY USA

来源：

NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current pre-trained models applied for summarization are prone to factual inconsistencies that misrepresent the source text. Evaluating the factual consistency of summaries is thus necessary to develop better models. However, the human evaluation setup for evaluating factual consistency has not been standardized. To determine the factors that affect the reliability of the human evaluation, we crowdsource evaluations for factual consistency across stateof-the-art models on two news summarization datasets using the rating-based Likert Scale and ranking-based Best-Worst Scaling. Our analysis reveals that the ranking-based Best-Worst Scaling offers a more reliable measure of summary quality across datasets and that the reliability of Likert ratings highly depends on the target dataset and the evaluation design. To improve crowdsourcing reliability, we extend the scale of the Likert rating and present a scoring algorithm for Best-Worst Scaling that we call value learning. Our crowdsourcing guidelines will be publicly available to facilitate future work on factual consistency in summarization.

引用

页码：5680 / 5692

页数：13

共 38 条

[1]

[Anonymous], 2019, 57 ANN M ASS COMP

[2] Affective Neural Response Generation [J].

Asghar, Nabiha ;

Poupart, Pascal ;

Hoey, Jesse ;

Jiang, Xin ;

Mou, Lili .

ADVANCES IN INFORMATION RETRIEVAL (ECIR 2018), 2018, 10772 :154-166

[3]

Cao M, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P6251

[4]

Dongwook Lee, 2019, arXiv

[5]

Durmus Esin, 2020, P 58 ANN M ASS COMP, P5055

[6]

Edunov S, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4052

[7]

Eyal M, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P3938

[8] SummEval: Re-evaluating Summarization Evaluation [J].

Fabbri, Alexander R. ;

Kryscinski, Wojciech ;

McCann, Bryan ;

Xiong, Caiming ;

Socher, Richard ;

Radev, Dragomir .

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 :391-409

[9]

Falke T, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2214

[10]

Gabriel Saadia, 2021, FINDINGS ASS COMPUTA, P478

← 1 2 3 4 →