Clustering Examples in Multi-Dataset NLP Benchmarks with Item Response Theory

被引：0

作者：

Rodriguez, Pedro

Htut, Phu Mon ^{[1
]}

Lalor, John P. ^{[2
]}

Sedoc, Joao ^{[1
]}

机构：

[1] NYU, New York, NY 10003 USA

[2] Univ Notre Dame, Notre Dame, IN 46556 USA

来源：

PROCEEDINGS OF THE THIRD WORKSHOP ON INSIGHTS FROM NEGATIVE RESULTS IN NLP (INSIGHTS 2022) | 2022年

关键词：

D O I：

暂无

中图分类号：

F [经济];

学科分类号：

02 ;

摘要：

In natural language processing, multi-dataset benchmarks for common tasks (e.g., SuperGLUE for natural language inference and MRQA for question answering) have risen in importance. Invariably, tasks and individual examples vary in difficulty. Recent analysis methods infer properties of examples such as difficulty. In particular, Item Response Theory (IRT) jointly infers example and model properties from the output of benchmark tasks (i.e., scores for each model-example pair). Therefore, it seems sensible that methods like IRT should be able to detect differences between datasets in a task. This work shows that current IRT models are not as good at identifying differences as we would expect, explain why this is difficult, and outline future directions that incorporate more (textual) signal from examples.

引用

页码：100 / 112

页数：13

共 34 条

[1] Multi-Dataset Benchmarks for Masked Identification using Contrastive Representation Learning
Seneviratne, Sachith
Kasthuriarachchi, Nuran
Rasnayaka, Sanka
2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 359 - 366
[2] Efficient and Robust Model Benchmarks with Item Response Theory and Adaptive Testing
Song, Hao
Flach, Peter
INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2021, 6 (05): : 110 - 118
[3] Examining the Dimensionality and Monotonicity of an Attitude Dataset based on the Item Response Theory Models
Kartal, Seval Kula
Dirlik, Ezgi Mor
INTERNATIONAL JOURNAL OF ASSESSMENT TOOLS IN EDUCATION, 2021, 8 (02): : 296 - 309
[4] A multi-dataset time-reversal approach to clinical trial placebo response and the relationship to natural variability in epilepsy
Goldenholz, Daniel M.
Strashny, Alex
Cook, Mark
Moss, Robert
Theodore, William H.
SEIZURE-EUROPEAN JOURNAL OF EPILEPSY, 2017, 53 : 31 - 36
[5] Using effect sizes for research reporting: Examples using item response theory to analyze differential item functioning
Steinberg, Lynne
Thissen, David
PSYCHOLOGICAL METHODS, 2006, 11 (04) : 402 - 415
[6] Estimating meaningful thresholds for multi-item questionnaires using item response theory
Berend Terluin
Jaimy E. Koopman
Lisa Hoogendam
Pip Griffiths
Caroline B. Terwee
Jakob B. Bjorner
Quality of Life Research, 2023, 32 : 1819 - 1830
[7] Estimating meaningful thresholds for multi-item questionnaires using item response theory
Terluin, Berend
Koopman, Jaimy E.
Hoogendam, Lisa
Griffiths, Pip
Terwee, Caroline B.
Bjorner, Jakob B.
QUALITY OF LIFE RESEARCH, 2023, 32 (06) : 1819 - 1830
[8] Application and Comparison of Multidimensional Latent Class Item Response Theory on Clustering Items in Comprehension Tests
Geramipour, Masoud
Shahmirzadi, Niloufar
JOURNAL OF ASIA TEFL, 2018, 15 (02): : 479 - 490
[9] Classical test theory and item response theory analyses of multi-item scales assessing parents' perceptions of their children's dental care
Hays, Ron D.
Brown, Julie
Brown, Lorraine U.
Spritzer, Karen L.
Crall, James J.
MEDICAL CARE, 2006, 44 (11) : S60 - S68
[10] Establishing clinical thresholds of multi-item tests: an item response theory approach to cut-off scores
Terluin, Berend
Griffiths, Philip
van der Wouden, Johannes
Ingelsrud, Lina Holm
Terwee, Caroline
QUALITY OF LIFE RESEARCH, 2019, 28 : S88 - S88

← 1 2 3 4 →