Named Entity Recognition and Data Leakage in Legislative Texts: A Literature Reassessment

被引:0
作者
Nunes, Rafael Oleques [1 ]
Spritzer, Andre Susliz [1 ]
Freitas, Carla Maria Dal Sasso [1 ]
Balreira, Dennis Giovani [1 ]
机构
[1] Univ Fed Rio Grande do Sul, Porto Alegre, Brazil
来源
LINGUAMATICA | 2024年 / 16卷 / 02期
关键词
data leakage; named entity recognition; legislative texts; benchmark; self-learning; Portuguese;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
This work addresses data leakage in training Named Entity Recognition (NER) models in Brazilian Portuguese legislative texts, resulting from duplicates and inconsistent annotations, which compromise model evaluation. After correcting this leakage in the UlyssesNER-Br corpus, we conducted a new benchmark, comparing the results with previous studies in a more reliable setting. We also re-evaluated a semi-supervised approach using self-learning and active sampling. However, by reusing a fixed threshold, chosen from a cloud of values before the correction, the results were unsatisfactory. This indicates that a dynamic threshold, which adapts to the characte ristics of the data post-correction, could provide a more efficient and accurate evaluation, highlighting the need for future studies on threshold selection.
引用
收藏
页数:26
相关论文
empty
未找到相关数据