An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

被引:3
作者
Carnaz, Goncalo [1 ]
Antunes, Mario [2 ,3 ]
Nogueira, Vitor Beires [1 ]
机构
[1] Univ Evora, Dept Informat, P-7002554 Evora, Portugal
[2] Polytech Leiria, Sch Technol & Management, Comp Sci & Commun Res Ctr CIIC, P-2411901 Leiria, Portugal
[3] CRACS, INESC TEC, P-4200465 Porto, Portugal
关键词
crime-related documents; cybersecurity; criminal investigation; Portuguese language corpus; natural language processing; 5W1H;
D O I
10.3390/data6070071
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.
引用
收藏
页数:11
相关论文
共 17 条
  • [1] Adderley R., 2014, P 4 INT C ADV INF MI, P36
  • [2] Atkins S., 1992, Literary & Linguistic Computing, V7, P1, DOI 10.1093/llc/7.1.1
  • [3] Biabani G., 2020, Q J SOC DEV PREVIOUS, V14, P199
  • [4] Braz J., 2013, INVESTIGAC AO CRIMIN
  • [5] Environmental Scanning and Knowledge Representation for the Detection of Organised Crime Threats
    Brewster, Ben
    Andrews, Simon
    Polovina, Simon
    Hirsch, Laurence
    Akhgar, Babak
    [J]. GRAPH-BASED REPRESENTATION AND REASONING, 2014, 8577 : 275 - 280
  • [6] Carnaz G., 2019, OPENACCESS SERIES IN, VVolume 74, DOI [10.4230/OASIcs.SLATE.2019.13, DOI 10.4230/OASICS.SLATE.2019.13]
  • [7] A Graph Database Representation of Portuguese Criminal-Related Documents
    Carnaz, Goncalo
    Nogueira, Vitor Beires
    Antunes, Mario
    [J]. INFORMATICS-BASEL, 2021, 8 (02):
  • [8] Fighting Organized Crime Through Open Source Intelligence: Regulatory Strategies of the CAPER Project
    Casanovas, Pompeu
    Arraiza, Juan
    Melero, Felipe
    Gonzalez-Conejero, Jorge
    Molcho, Gila
    Cuadros, Montse
    [J]. LEGAL KNOWLEDGE AND INFORMATION SYSTEMS, 2014, 271 : 189 - 198
  • [9] Chakma K, 2018, COMPUT SIST, V22, P747, DOI [10.13053/CyS-22-3-3016, 10.13053/cys-22-3-3016]
  • [10] Das A., P 6 INT C NAT LANG P, P1