CachacaNER: a dataset for named entity recognition in texts about the cachaca beverage

被引:2
作者
Silva, Priscilla [1 ]
Franco, Arthur [1 ]
Santos, Thiago [1 ]
Brito, Mozar [2 ]
Pereira, Denilson [1 ]
机构
[1] Univ Fed Lavras, Dept Comp Sci, POB 3037, BR-37200900 Lavras, MG, Brazil
[2] Univ Fed Lavras, Dept Agroind Management, POB 3037, BR-37200900 Lavras, MG, Brazil
关键词
NER; Named entity recognition; Dataset; Labeled data; Cachaca;
D O I
10.1007/s10579-023-09665-0
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Named Entity Recognition (NER) is the task of identifying and classifying tokens in texts corresponding to a set of pre-defined categories, such as names of people, organizations and locations. Datasets labeled for this task are essential for training supervised machine learning models. Although there are many datasets labeled with texts for English, in the Portuguese language they are scarcer. This work contributes to the creation and evaluation of a manually labeled dataset for the NER task, with texts in Brazilian Portuguese, in the specific domain of the beverage called Cachaca. This is a popular drink in Brazil, and of great economic importance. This is the first NER dataset in the beverage domain, and can be useful for other types of beverages with similar entity categories, such as wine and beer. We describe the process of data collection, creation of the dataset and its experimental evaluation. As a result, we created a dataset containing over 180,000 tokens labeled in 17 entity categories. The labeling obtained an agreement coefficient of 0.857 among the labelers, according to the Fleiss' Kappa metric, which is considered almost perfect. In our experimental evaluation, we obtained a micro-F1 value equal to 0.933 in the test set. The size of the dataset, as well as the result of its experimental evaluation, are comparable to other datasets in the Portuguese language, even though ours has a greater number of entity categories.
引用
收藏
页码:1315 / 1333
页数:19
相关论文
共 34 条
[1]   Named Entity Recognition from Unstructured Handwritten Document Images [J].
Adak, Chandranath ;
Chaudhuri, Bidyut B. ;
Blumenstein, Michael .
PROCEEDINGS OF 12TH IAPR WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS, (DAS 2016), 2016, :375-380
[2]   UlyssesNER-Br: A Corpus of Brazilian Legislative Documents for Named Entity Recognition [J].
Albuquerque, Hidelberg O. ;
Costa, Rosimeire ;
Silvestre, Gabriel ;
Souza, Ellen ;
da Silva, Nadia F. F. ;
Vitorio, Douglas ;
Moriyama, Gyovana ;
Martins, Lucas ;
Soezima, Luiza ;
Nunes, Augusto ;
Siqueira, Felipe ;
Tarrega, Joao P. ;
Beinotti, Joao, V ;
Dias, Marcio ;
Silva, Matheus ;
Gardini, Miguel ;
Silva, Vinicius ;
de Carvalho, Andre C. P. L. F. ;
Oliveira, Adriano L., I .
COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022, 2022, 13208 :3-14
[3]  
Bortoletto AM., 2016, THESIS ESCOLA SUPERI
[4]   An Improved NER Methodology to the Portuguese Language [J].
de Aquino Silva, Rogerio ;
da Silva, Luana ;
Dutra, Moises Lima ;
de Araujo, Gustavo Medeiros .
MOBILE NETWORKS & APPLICATIONS, 2021, 26 (01) :319-325
[5]   Assessing the Effectiveness of Multilingual Transformer-based Text Embeddings for Named Entity Recognition in Portuguese [J].
de Lima Santos, Diego Bernardes ;
de Carvalho Dutra, Frederico Giffoni ;
Parreiras, Fernando Silva ;
Brandao, Wladmir Cardoso .
PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS (ICEIS 2021), VOL 1, 2021, :473-483
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]  
Erik F., 2003, P 7 C NAT LANG LEARN, P142, DOI DOI 10.3115/1119176.1119195
[8]  
ExpoCachaca, 2022, NUM CACH IMP MERC CA
[9]  
FLEISS JL, 1971, PSYCHOL BULL, V76, P378, DOI 10.1037/h0031619
[10]  
Freitas C, 2010, LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P3630