Analysis of systems' performance in natural language processing competitions

被引:0
作者
Nava-Munoz, Sergio [1 ,2 ]
Graff, Mario [2 ,3 ]
Escalante, Hugo Jair [4 ]
机构
[1] CIMA Ctr Invest Matemat A C, Calzada Plenitud 103, Jose Vasconcelos 20200, Aguascalientes, Mexico
[2] INFOTEC Ctr Invest Innovac Tecnol Informac & Comun, Circuito Tecnopolo 112,Fracc Tecnopolo Pocitos 2, Aguascalientes 20313, Aguascalientes, Mexico
[3] Consejo Nacl Human Ciencia & Tecnol CONAHCYT, Insurgentes 1582, Benito Juarez 03940, Ciudad de Mexic, Mexico
[4] INAOE Inst Nacl Astrofis Opt & Elect, Luis Enr Erro 1, Tonantzintla 72840, Puebla, Mexico
关键词
Performance; Bootstrap; Challenges; STATISTICAL COMPARISONS; RECOMMENDATION SYSTEM; REST-MEX; CLASSIFIERS; TEXT;
D O I
10.1016/j.patrec.2024.03.010
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Collaborative competitions have gained popularity in the scientific and technological fields. These competitions involve defining tasks, selecting evaluation scores, and devising result verification methods. In the standard scenario, participants receive a training set and are expected to provide a solution for a held-out dataset kept by organizers. An essential challenge for organizers arises when comparing algorithms' performance, assessing multiple participants, and ranking them. Statistical tools are often used for this purpose; however, traditional statistical methods often fail to capture decisive differences between systems' performance. This manuscript describes an evaluation methodology for statistically analyzing competition results and competition. The methodology is designed to be universally applicable; however, it is illustrated using eight natural language competitions as case studies involving classification and regression problems. The proposed methodology offers several advantages, including off-the-shell comparisons with correction mechanisms and the inclusion of confidence intervals. Furthermore, we introduce metrics that allow organizers to assess the difficulty of competitions. Our analysis shows the potential usefulness of our methodology for effectively evaluating competition results.
引用
收藏
页码:346 / 353
页数:8
相关论文
共 24 条
[1]   VaxxStance@IberLEF 2021: Overview of the Task on Going Beyond Text in Cross-Lingual Stance Detection [J].
Agerri, Rodrigo ;
Centeno, Roberto ;
Espinosa, Maria ;
Fernandez de Landa, Joseba ;
Rodrigo, Alvaro .
PROCESAMIENTO DEL LENGUAJE NATURAL, 2021, (67) :173-181
[2]   Overview of Rest-Mex at IberLEF 2022: Recommendation System, Sentiment Analysis and Covid Semaphore Prediction for Mexican Tourist Texts [J].
Alvarez-Carmona, Miguel A. ;
Diaz-Pacheco, Angel ;
Aranda, Ramon ;
Rodriguez-Gonzalez, Ansel Y. ;
Fajardo-Delgado, Daniel ;
Guerrero-Rodriguez, Rafael ;
Bustio-Martinez, Lazaro .
PROCESAMIENTO DEL LENGUAJE NATURAL, 2022, (69) :289-299
[3]   Overview of Rest-Mex at IberLEF 2021: Recommendation System for Text Mexican Tourism [J].
Alvarez-Carmona, Miguel A. ;
Aranda, Ramon ;
Arce-Cardenas, Samuel ;
Fajardo-Delgado, Daniel ;
Guerrero-Rodriguez, Rafael ;
Pastor Lopez-Monroy, A. ;
Martinez-Miranda, Juan ;
Perez-Espinosa, Humberto ;
Rodriguez-Gonzalez, Ansel Y. .
PROCESAMIENTO DEL LENGUAJE NATURAL, 2021, (67) :163-172
[4]  
Aragon M. E., 2019, P IBERIAN LANGUAGES
[5]   Overview of PAR-MEX at Iberlef 2022: Paraphrase Detection in Spanish Shared Task [J].
Bel-Enguix, Gemma ;
Sierra, Gerardo ;
Gomez-Adorno, Helena ;
Torres-Moreno, Juan-Manuel ;
Ortiz-Barajas, Jesus-German ;
Vasquez, Juan .
PROCESAMIENTO DEL LENGUAJE NATURAL, 2022, (69) :255-263
[6]   CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[7]  
Berg-Kirkpatrick Taylor, 2012, EMNLP CONLL 2012 201
[8]  
Bisani M, 2004, 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P409
[9]  
Chernick M.R. R.A. LaBudde., 2011, An Introduction to Bootstrap Methods with Applications to R, V1st, DOI DOI 10.1016/J.IFACOL.2018.08.474
[10]  
Demsar J, 2006, J MACH LEARN RES, V7, P1