Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

被引:190
作者
Geva, Mor [1 ,2 ]
Khashabi, Daniel [2 ]
Segal, Elad [1 ]
Khot, Tushar [2 ]
Roth, Dan [3 ]
Berant, Jonathan [1 ,2 ]
机构
[1] Tel Aviv Univ, Tel Aviv, Israel
[2] Allen Inst AI, Seattle, WA 98103 USA
[3] Univ Penn, Philadelphia, PA 19104 USA
基金
欧洲研究理事会;
关键词
Population statistics;
D O I
10.1162/tacl_a_00370
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A key limitation in current datasets for multihop reasoning is that the required steps for answering the question are mentioned in it explicitly. In this work, we introduce STRATEGYQA, a question answering (QA) benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy. A fundamental challenge in this setup is how to elicit such creative questions from crowdsourcing workers, while covering a broad range of potential strategies.We propose a data collection procedure that combines term-based priming to inspire annotators, careful control over the annotator population, and adversarial filtering for eliminating reasoning shortcuts. Moreover, we annotate each question with (1) a decomposition into reasoning steps for answering it, and (2) Wikipedia paragraphs that contain the answers to each step. Overall, STRATEGYQA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs. Analysis shows that questions in STRATEGYQA are short, topic-diverse, and cover a wide range of strategies. Empirically, we show that humans perform well (87%) on this task, while our best baseline reaches an accuracy of similar to 66%
引用
收藏
页码:346 / 361
页数:16
相关论文
共 27 条
[1]   Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension [J].
Bartolo, Max ;
Roberts, Alastair ;
Welbl, Johannes ;
Riedel, Sebastian ;
Stenetorp, Pontus .
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 :662-678
[2]  
Clark C, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P2924
[3]   TYDI QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages [J].
Clark, Jonathan H. ;
Choi, Eunsol ;
Collins, Michael ;
Garrette, Dan ;
Kwiatkowski, Tom ;
Nikolaev, Vitaly ;
Palomaki, Jennimaria .
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 :454-470
[4]  
DeYoung J, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4443
[5]  
Dua Dheeru, 2019, North American Chapter of the Association for Computational Linguistics
[6]  
Geva M, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P1161
[7]  
Gururangan S., 2018, P 2018 C N AM CHAPT, V2, P107, DOI [DOI 10.18653/V1/N18-2017, 10.18653/v1/N18]
[8]  
Jiang Yichen, 2019, AVOIDING REASONING S, DOI [10.18653/v1/P19-1262, DOI 10.18653/V1/P19-1262]
[9]  
Khashabi Daniel, 2018, P 2018 C N AM CHAPTE, V1, P252, DOI DOI 10.18653/V1/N18-1023
[10]  
Khot Tushar., 2020, Proceedings of the AAAI Conference on Artificial Intelligence, V34, P8082, DOI DOI 10.1609/AAAI.V34I05.6319