Named Entity Recognition and Data Leakage in Legislative Texts: A Literature Reassessment

被引：0

作者：

Nunes, Rafael Oleques ^{[1
]}

Spritzer, Andre Susliz ^{[1
]}

Freitas, Carla Maria Dal Sasso ^{[1
]}

Balreira, Dennis Giovani ^{[1
]}

机构：

[1] Univ Fed Rio Grande do Sul, Porto Alegre, Brazil

来源：

LINGUAMATICA | 2024年 / 16卷 / 02期

关键词：

data leakage; named entity recognition; legislative texts; benchmark; self-learning; Portuguese;

D O I：

暂无

中图分类号：

H0 [语言学];

学科分类号：

030303 ; 0501 ; 050102 ;

摘要：

This work addresses data leakage in training Named Entity Recognition (NER) models in Brazilian Portuguese legislative texts, resulting from duplicates and inconsistent annotations, which compromise model evaluation. After correcting this leakage in the UlyssesNER-Br corpus, we conducted a new benchmark, comparing the results with previous studies in a more reliable setting. We also re-evaluated a semi-supervised approach using self-learning and active sampling. However, by reusing a fixed threshold, chosen from a cloud of values before the correction, the results were unsatisfactory. This indicates that a dynamic threshold, which adapts to the characte ristics of the data post-correction, could provide a more efficient and accurate evaluation, highlighting the need for future studies on threshold selection.

引用

页数：26