AGORA: An intelligent system for the anonymization, information extraction and automatic mapping of sensitive documents

被引:1
作者
Juez-Hernandez, Rodrigo [1 ]
Quijano-Sanchez, Lara [1 ,2 ,5 ]
Liberatore, Federico [2 ,3 ]
Gomez, Jesus [4 ]
机构
[1] Univ Autonoma Madrid, Escuela Politecn Super, Madrid, Spain
[2] Univ Carlos III Madrid, Santander Big Data Inst UC3M, Madrid, Spain
[3] Cardiff Univ, Sch Comp Sci & Informat, Cardiff, Wales
[4] Minist Interior, Oficina Nacl Lucha Delitos Odio, Madrid, Spain
[5] Univ Autonoma Madrid, Escuela Politecn Super, C Francisco Tomas & Valiente 11,Campus Cantoblanco, Madrid 28049, Spain
关键词
Document anonymization; Information extraction; Named Entity Recognition; Natural language processing; Visualization tools; Document sharing; DE-IDENTIFICATION; SET;
D O I
10.1016/j.asoc.2023.110540
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Public institutions, such as law enforcement agencies or health centers, have a vast volume of unstructured text documents, e.g. police reports. Currently, before this data can be shared (e.g. with research institutions), it must go through a lengthy and costly human anonymization procedure. This paper addresses this issue by presenting AGORA, a cutting-edge tool that automatically identifies key entities and anonymizes sensitive data in text documents. AGORA has been developed in partnership with the Spanish National Office Against Hate Crimes and validated in the police and medical domains. This tool allows to export both anonymized texts and identified entities to structured files, thus, simplifying its exploitation for analysis purposes. Also, AGORA is capable of plotting the location entities identified in the documents, as well as obtaining and displaying relevant information from their geographical surroundings. Thus, it simplifies the task of generating comprehensive datasets for subsequent data analysis or predictive tasks. Its main goal is to foster cooperation between public institutions and research centers by easing document sharing as well as serving as a foundation for addressing succeeding phases in data science. The paper conducts a comprehensive assessment of the literature on Named Entity Recognition methodologies and technologies. Followed by extensive computational experiments to identify the best configuration for the NER models embedded in AGORA which include both successful state-of-the-art model setups and novelly proposed ones. Finally, the methodology, conclusions and software provided can be easily reused in similar application scenarios.& COPY; 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
引用
收藏
页数:11
相关论文
共 81 条
  • [51] Transformers for Clinical Coding in Spanish
    Lopez-Garcia, Guillermo
    Jerez, Jose M.
    Ribelles, Nuria
    Alba, Emilio
    Veredas, Francisco J.
    [J]. IEEE ACCESS, 2021, 9 : 72387 - 72397
  • [52] Lopez-Ubeda P., 2019, IBERLEF SEPLN, P687
  • [53] Ma XZ, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P1064
  • [54] The Stanford CoreNLP Natural Language Processing Toolkit
    Manning, Christopher D.
    Surdeanu, Mihai
    Bauer, John
    Finkel, Jenny
    Bethard, Steven J.
    McClosky, David
    [J]. PROCEEDINGS OF 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: SYSTEM DEMONSTRATIONS, 2014, : 55 - 60
  • [55] Location Extraction from Social Media: Geoparsing, Location Disambiguation, and Geotagging
    Middleton, Stuart E.
    Kordopatis-Zilos, Giorgos
    Papadopoulos, Symeon
    Kompatsiaris, Yiannis
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2018, 36 (04)
  • [56] Mohit B., 2014, Natural Language Processing of Semitic Languages: Theory and Applications of Natural Language Processing, P221, DOI [10.1007/978-3-642-45358-8 7, DOI 10.1007/978-3-642-45358-87]
  • [57] Geographic Named Entity Recognition and Disambiguation in Mexican News using word embeddings
    Molina-Villegas, Alejandro
    Muniz-Sanchez, Victor
    Arreola-Trapala, Jean
    Alcantara, Filomeno
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2021, 176
  • [58] Automated Geoparsing of Paris Street Names in 19th Century Novels
    Moncla, Ludovic
    Gaio, Mauro
    Joliveau, Thierry
    Le Lay, Yves-Francois
    [J]. GEOHUMANITIES'17: PROCEEDINGS OF THE 1ST ACM SIGSPATIAL WORKSHOP ON GEOSPATIAL HUMANITIES, 2017, : 1 - 8
  • [59] Mozharova Valerie A, 2016, INT C ANAL IMAGES SO, P185
  • [60] Extracting Location Names from Unstructured Italian Texts Using Grammar Rules and MapReduce
    Napoli, Christian
    Tramontana, Emiliano
    Verga, Gabriella
    [J]. INFORMATION AND SOFTWARE TECHNOLOGIES, ICIST 2016, 2016, 639 : 593 - 601