Clustering Examples in Multi-Dataset NLP Benchmarks with Item Response Theory

被引:0
|
作者
Rodriguez, Pedro
Htut, Phu Mon [1 ]
Lalor, John P. [2 ]
Sedoc, Joao [1 ]
机构
[1] NYU, New York, NY 10003 USA
[2] Univ Notre Dame, Notre Dame, IN 46556 USA
来源
PROCEEDINGS OF THE THIRD WORKSHOP ON INSIGHTS FROM NEGATIVE RESULTS IN NLP (INSIGHTS 2022) | 2022年
关键词
D O I
暂无
中图分类号
F [经济];
学科分类号
02 ;
摘要
In natural language processing, multi-dataset benchmarks for common tasks (e.g., SuperGLUE for natural language inference and MRQA for question answering) have risen in importance. Invariably, tasks and individual examples vary in difficulty. Recent analysis methods infer properties of examples such as difficulty. In particular, Item Response Theory (IRT) jointly infers example and model properties from the output of benchmark tasks (i.e., scores for each model-example pair). Therefore, it seems sensible that methods like IRT should be able to detect differences between datasets in a task. This work shows that current IRT models are not as good at identifying differences as we would expect, explain why this is difficult, and outline future directions that incorporate more (textual) signal from examples.
引用
收藏
页码:100 / 112
页数:13
相关论文
共 34 条
  • [1] Multi-Dataset Benchmarks for Masked Identification using Contrastive Representation Learning
    Seneviratne, Sachith
    Kasthuriarachchi, Nuran
    Rasnayaka, Sanka
    2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 359 - 366
  • [2] Efficient and Robust Model Benchmarks with Item Response Theory and Adaptive Testing
    Song, Hao
    Flach, Peter
    INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2021, 6 (05): : 110 - 118
  • [3] Examining the Dimensionality and Monotonicity of an Attitude Dataset based on the Item Response Theory Models
    Kartal, Seval Kula
    Dirlik, Ezgi Mor
    INTERNATIONAL JOURNAL OF ASSESSMENT TOOLS IN EDUCATION, 2021, 8 (02): : 296 - 309
  • [4] A multi-dataset time-reversal approach to clinical trial placebo response and the relationship to natural variability in epilepsy
    Goldenholz, Daniel M.
    Strashny, Alex
    Cook, Mark
    Moss, Robert
    Theodore, William H.
    SEIZURE-EUROPEAN JOURNAL OF EPILEPSY, 2017, 53 : 31 - 36
  • [5] Using effect sizes for research reporting: Examples using item response theory to analyze differential item functioning
    Steinberg, Lynne
    Thissen, David
    PSYCHOLOGICAL METHODS, 2006, 11 (04) : 402 - 415
  • [6] Estimating meaningful thresholds for multi-item questionnaires using item response theory
    Berend Terluin
    Jaimy E. Koopman
    Lisa Hoogendam
    Pip Griffiths
    Caroline B. Terwee
    Jakob B. Bjorner
    Quality of Life Research, 2023, 32 : 1819 - 1830
  • [7] Estimating meaningful thresholds for multi-item questionnaires using item response theory
    Terluin, Berend
    Koopman, Jaimy E.
    Hoogendam, Lisa
    Griffiths, Pip
    Terwee, Caroline B.
    Bjorner, Jakob B.
    QUALITY OF LIFE RESEARCH, 2023, 32 (06) : 1819 - 1830
  • [8] Application and Comparison of Multidimensional Latent Class Item Response Theory on Clustering Items in Comprehension Tests
    Geramipour, Masoud
    Shahmirzadi, Niloufar
    JOURNAL OF ASIA TEFL, 2018, 15 (02): : 479 - 490
  • [9] Classical test theory and item response theory analyses of multi-item scales assessing parents' perceptions of their children's dental care
    Hays, Ron D.
    Brown, Julie
    Brown, Lorraine U.
    Spritzer, Karen L.
    Crall, James J.
    MEDICAL CARE, 2006, 44 (11) : S60 - S68
  • [10] Establishing clinical thresholds of multi-item tests: an item response theory approach to cut-off scores
    Terluin, Berend
    Griffiths, Philip
    van der Wouden, Johannes
    Ingelsrud, Lina Holm
    Terwee, Caroline
    QUALITY OF LIFE RESEARCH, 2019, 28 : S88 - S88