UlyssesNER-Br: A Corpus of Brazilian Legislative Documents for Named Entity Recognition

被引：12

作者：

Albuquerque, Hidelberg O. ^{[1
,2
]}

Costa, Rosimeire ^{[3
]}

Silvestre, Gabriel ^{[4
]}

Souza, Ellen ^{[1
,4
]}

da Silva, Nadia F. F. ^{[3
,4
]}

Vitorio, Douglas ^{[1
,2
]}

Moriyama, Gyovana ^{[4
]}

Martins, Lucas ^{[4
]}

Soezima, Luiza ^{[4
]}

Nunes, Augusto ^{[4
]}

Siqueira, Felipe ^{[4
]}

Tarrega, Joao P. ^{[4
]}

Beinotti, Joao, V ^{[4
]}

Dias, Marcio ^{[3
]}

Silva, Matheus ^{[3
]}

Gardini, Miguel ^{[4
]}

Silva, Vinicius ^{[4
]}

de Carvalho, Andre C. P. L. F. ^{[4
]}

Oliveira, Adriano L., I ^{[2
]}

机构：

[1] Univ Fed Rural Pernambuco, MiningBR Res Grp, Recife, PE, Brazil

[2] Univ Fed Pernambuco, Ctr Informat, Recife, PE, Brazil

[3] Univ Fed Goias, Inst Informat, Goiania, Go, Brazil

[4] Univ Sao Paulo, Inst Math & Comp Sci, Sao Paulo, Brazil

来源：

COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022 | 2022年 / 13208卷

基金：

巴西圣保罗研究基金会;

关键词：

Annotation schema; Named Entity Recognition; Legal information retrieval;

D O I：

10.1007/978-3-030-98305-5_1

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The amount of legislative documents produced within the past decade has risen dramatically, making it difficult for law practitioners to consult and update legislation. Named Entity Recognition (NER) systems have the untapped potential to extract information from legal documents, which can improve information retrieval and decision-making processes. We introduce the UlyssesNER-Br, a corpus of Brazilian Legislative Documents for NER with quality baselines. The presented corpus consists of bills and legislative consultations from Brazilian Chamber of Deputies. We implemented Conditional Random Field (CRF) and Hidden Markov Model (HMM) models, and the promising F1-score of 80.8% in the analysis by categories and 81.04% in the analysis by types, was achieved with the CRF model. The entities with the best average F1score results were "FUNDlei" and "DATA", and the ones with the worst results were "EVENTO" and "PESSOAgrupoind". The corpus was also evaluated using a BiLSTM-CRF and Glove architectures provided by the pioneering state-of-the-art paper, achieving F1-score of 76.89% in the analysis by categories and 59.67% in the analysis by types.

引用

页码：3 / 14

页数：12

共 19 条

[1]

Alles V.J., 2018, THESIS U BRASILIA BR

[2]

Almeida P.G.R., 2021, RED INFORM, V24

[3]

Angelidis Iosif, 2018, P 31 INT C LEG KNOWL

[4]

[Anonymous], 2022, IEEE T KNOWL DATA EN, DOI DOI 10.1109/TKDE.2020.2981314

[5]

Badji Ines, 2018, Tese de Doutoramento

[6]

Brandt MB, 2020, Modelagem da informacao~ legislativa: arquitetura da informacaopara o processo legislativo brasileiro

[7]

Castro P.V.Q., 2019, PROGRAMA POS GRADUAC

[8]

Klie Jan-Christoph, 2018, P 27 INT C COMP LING

[9]

Lafferty J., 2001, CONDITIONAL RANDOM F, V1, P3

[10]

Leitner E., 2020, LREC 2020

← 1 2 →