Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

被引：190

作者：

Geva, Mor ^{[1
,2
]}

Khashabi, Daniel ^{[2
]}

Segal, Elad ^{[1
]}

Khot, Tushar ^{[2
]}

Roth, Dan ^{[3
]}

Berant, Jonathan ^{[1
,2
]}

机构：

[1] Tel Aviv Univ, Tel Aviv, Israel

[2] Allen Inst AI, Seattle, WA 98103 USA

[3] Univ Penn, Philadelphia, PA 19104 USA

来源：

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS | 2021年 / 9卷

基金：

欧洲研究理事会;

关键词：

Population statistics;

D O I：

10.1162/tacl_a_00370

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A key limitation in current datasets for multihop reasoning is that the required steps for answering the question are mentioned in it explicitly. In this work, we introduce STRATEGYQA, a question answering (QA) benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy. A fundamental challenge in this setup is how to elicit such creative questions from crowdsourcing workers, while covering a broad range of potential strategies.We propose a data collection procedure that combines term-based priming to inspire annotators, careful control over the annotator population, and adversarial filtering for eliminating reasoning shortcuts. Moreover, we annotate each question with (1) a decomposition into reasoning steps for answering it, and (2) Wikipedia paragraphs that contain the answers to each step. Overall, STRATEGYQA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs. Analysis shows that questions in STRATEGYQA are short, topic-diverse, and cover a wide range of strategies. Empirically, we show that humans perform well (87%) on this task, while our best baseline reaches an accuracy of similar to 66%

引用

页码：346 / 361

页数：16

共 27 条

[1] Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension [J].

Bartolo, Max ;

Roberts, Alastair ;

Welbl, Johannes ;

Riedel, Sebastian ;

Stenetorp, Pontus .

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 :662-678

[2]

Clark C, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P2924

[3] TYDI QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages [J].

Clark, Jonathan H. ;

Choi, Eunsol ;

Collins, Michael ;

Garrette, Dan ;

Kwiatkowski, Tom ;

Nikolaev, Vitaly ;

Palomaki, Jennimaria .

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 :454-470

[4]

DeYoung J, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4443

[5]

Dua Dheeru, 2019, North American Chapter of the Association for Computational Linguistics

[6]

Geva M, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P1161

[7]

Gururangan S., 2018, P 2018 C N AM CHAPT, V2, P107, DOI [DOI 10.18653/V1/N18-2017, 10.18653/v1/N18]

[8]

Jiang Yichen, 2019, AVOIDING REASONING S, DOI [10.18653/v1/P19-1262, DOI 10.18653/V1/P19-1262]

[9]

Khashabi Daniel, 2018, P 2018 C N AM CHAPTE, V1, P252, DOI DOI 10.18653/V1/N18-1023

[10]

Khot Tushar., 2020, Proceedings of the AAAI Conference on Artificial Intelligence, V34, P8082, DOI DOI 10.1609/AAAI.V34I05.6319

← 1 2 3 →