Stance Detection Benchmark: How Robust is Your Stance Detection?

被引：42

作者：

Schiller, Benjamin ^{[1
]}

Daxenberger, Johannes ^{[1
]}

Gurevych, Iryna ^{[1
]}

机构：

[1] Tech Univ Darmstadt, Dept Comp Sci, Ubiquitous Knowledge Proc Lab, Darmstadt, Germany

来源：

KUNSTLICHE INTELLIGENZ | 2021年 / 35卷 / 3-4期

关键词：

Stance detection; Robustness; Multi-dataset learning; AGREEMENT;

D O I：

10.1007/s13218-021-00714-w

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Stance detection (StD) aims to detect an author's stance towards a certain topic and has become a key component in applications like fake news detection, claim validation, or argument search. However, while stance is easily detected by humans, machine learning (ML) models are clearly falling short of this task. Given the major differences in dataset sizes and framing of StD (e.g. number of classes and inputs), ML models trained on a single dataset usually generalize poorly to other domains. Hence, we introduce a StD benchmark that allows to compare ML models against a wide variety of heterogeneous StD datasets to evaluate them for generalizability and robustness. Moreover, the framework is designed for easy integration of new datasets and probing methods for robustness. Amongst several baseline models, we define a model that learns from all ten StD datasets of various domains in a multi-dataset learning (MDL) setting and present new state-of-the-art results on five of the datasets. Yet, the models still perform well below human capabilities and even simple perturbations of the original test samples (adversarial attacks) severely hurt the performance of MDL models. Deeper investigation suggests overfitting on dataset biases as the main reason for the decreased robustness. Our analysis emphasizes the need of focus on robustness and de-biasing strategies in multi-task learning approaches. To foster research on this important topic, we release the dataset splits, code, and fine-tuned weights.

引用

页码：329 / 341

页数：13

共 55 条

[1]

[Anonymous], 2019, EMNLP IJCNLP, DOI DOI 10.18653/V1/P19-10732

[2]

[Anonymous], 2013, P 6 INT JOINT C NAT

[3]

Atanasova P, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P3168

[4]

Augenstein Isabelle, 2018, LONG PAPERS, P1896

[5]

Bar-Haim R, 2017, 15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2017), VOL 1: LONG PAPERS, P251

[6]

Belinkov Y., 2017, ARXIV171102173CSCL

[7]

Chen S, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P542

[8]

Clark C, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P4069

[9] A COEFFICIENT OF AGREEMENT FOR NOMINAL SCALES [J].

COHEN, J .

EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1960, 20 (01) :37-46

[10]

Conforti C, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P1715

← 1 2 3 4 5 6 →