A Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers

被引:24
作者
Hamdi, Ahmed [1 ]
Pontes, Elvys Linhares [1 ]
Boros, Emanuela [1 ]
Thi Tuyet Hai Nguyen [1 ]
Hackl, Guenter [2 ]
Moreno, Jose G. [3 ]
Doucet, Antoine [1 ]
机构
[1] Univ La Rochelle, L3i, La Rochelle, France
[2] Innsbruck Univ Innovat GmbH, Innsbruck, Austria
[3] Univ Toulouse, IRIT, Toulouse, France
来源
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL | 2021年
关键词
datasets; multilingual; diachronic historical newspapers; named entity recognition; entity linking; stance detection;
D O I
10.1145/3404835.3463255
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Named entity processing over historical texts is more and more being used due to the massive documents and archives being stored in digital libraries. However, due to the poor annotated resources of historical nature, information extraction performances fall behind those on contemporary texts. In this paper, we introduce the development of the NewsEye resource, a multilingual dataset for named entity recognition and linking enriched with stances towards named entities. The dataset is comprised of diachronic historical newspaper material published between 1850 and 1950 in French, German, Finnish, and Swedish. Such historical resource is essential in the context of developing and evaluating named entity processing systems. It evenly allows enhancing the performances of existing approaches on historical documents which enables adequate and efficient semantic indexing of historical documents on digital cultural heritage collections.
引用
收藏
页码:2328 / 2334
页数:7
相关论文
共 26 条
[1]  
Ahmed Sajawel, 2019, Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), P871, DOI DOI 10.18653/V1/K19-1081
[2]  
Bamman D, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P2138
[3]  
Boros E, 2020, C LABS EVALUATION FO, V2696, P1, DOI DOI 10.5281/ZENODO.4068074
[4]  
Boros Emanuela, 2020, 24 C COMP NAT LANG L, P431, DOI [DOI 10.18653/V1/2020.CONLL-1.35, DOI 10.18653/V1/2020.CONLL-1.3]
[5]   A COEFFICIENT OF AGREEMENT FOR NOMINAL SCALES [J].
COHEN, J .
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1960, 20 (01) :37-46
[6]  
Colavizza G., 2017, J OPEN HUMANIT DATA, DOI DOI 10.5334/JOHD.9
[7]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8]  
Ehrmann M., 2020, Impresso Named Entity Annotation Guidelines, DOI DOI 10.5281/ZENODO.3604227
[9]  
Ehrmann M, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P958
[10]   Introducing the CLEF 2020 HIPE Shared Task: Named Entity Recognition and Linking on Historical Newspapers [J].
Ehrmann, Maud ;
Romanello, Matteo ;
Bircher, Stefan ;
Clematide, Simon .
ADVANCES IN INFORMATION RETRIEVAL, ECIR 2020, PT II, 2020, 12036 :524-532