Fiscal data in text: Information extraction from audit reports using Natural Language Processing

被引:0
|
作者
Beltran, Alejandro [1 ]
机构
[1] Alan Turing Inst, London, England
来源
DATA & POLICY | 2023年 / 5卷
关键词
auditing; corruption; natural language processing; subnational governments; text-as-data; CORRUPTION; MALFEASANCE;
D O I
10.1017/dap.2023.4
中图分类号
C93 [管理学]; D035 [国家行政管理]; D523 [行政管理]; D63 [国家行政管理];
学科分类号
12 ; 1201 ; 1202 ; 120202 ; 1204 ; 120401 ;
摘要
Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.Policy Significance Statement Annual audits by supreme audit institutions produce important information on the health and accuracy of govern-mental budgets. These reports include the monetary value of discrepancies, missing funds, and corrupt actions. This paper offers a strategy for collecting that information from historical audit reports and creating a database on budgetary discrepancies. It uses machine learning and natural language processing to accelerate and scale the collection of data to thousands of paragraphs. The granularity of the budgetary data obtained through this approach is useful to reformers and policymakers who require detailed data on municipal finances. This approach can also be applied to other countries that publish audit reports in PDF documents across different languages and contexts.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Image Text Extraction and Natural Language Processing of Unstructured Data from Medical Reports
    Malashin, Ivan
    Masich, Igor
    Tynchenko, Vadim
    Gantimurov, Andrei
    Nelyub, Vladimir
    Borodulin, Aleksei
    MACHINE LEARNING AND KNOWLEDGE EXTRACTION, 2024, 6 (02): : 1361 - 1377
  • [2] Automating Stroke Data Extraction From Free-Text Radiology Reports Using Natural Language Processing: Instrument Validation Study
    Yu, Amy Y. X.
    Liu, Zhongyu A.
    Pou-Prom, Chloe
    Lopes, Kaitlyn
    Kapral, Moira K.
    Aviv, Richard, I
    Mamdani, Muhammad
    JMIR MEDICAL INFORMATICS, 2021, 9 (05)
  • [3] Analyzing and Visualizing Text Information in Corporate Sustainability Reports Using Natural Language Processing Methods
    Kang, Hyewon
    Kim, Jinho
    APPLIED SCIENCES-BASEL, 2022, 12 (11):
  • [4] Data Extraction by Using Natural Language Processing Tool
    More, Sujata D.
    Madankar, Mangala S.
    Chandak, M. B.
    HELIX, 2018, 8 (05): : 3846 - 3848
  • [5] Extraction of Disease Symptoms from Free Text Using Natural Language Processing Techniques
    Laabidi, Adil
    Aissaoui, Mohammed
    Madani, Mohamed Amine
    PROCEEDINGS OF NINTH INTERNATIONAL CONGRESS ON INFORMATION AND COMMUNICATION TECHNOLOGY, VOL 2, ICICT 2024, 2024, 1012 : 549 - 561
  • [6] Natural Language Processing Pipeline for Temporal Information Extraction and Classification from Free Text Eligibility Criteria
    Parthasarathy, Gayathri
    Olmsted, Aspen
    Anderson, Paul
    INTERNATIONAL CONFERENCE ON INFORMATION SOCIETY (I-SOCIETY 2016), 2016, : 120 - 121
  • [7] A Hybrid Approach for Spatial Information Extraction from Natural Language Text
    Hassini, Nesrine
    Mahmoudi, Khaoula
    Faiz, Sami
    2023 20TH ACS/IEEE INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, AICCSA, 2023,
  • [8] Automatic Extraction of Major Osteoporotic Fractures from Radiology Reports using Natural Language Processing
    Wang, Yanshan
    Mehrabi, Saeed
    Sohn, Sunghwan
    Atkinson, Elizabeth
    Amin, Shreyasee
    Liu, Hongfang
    2018 IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS WORKSHOPS (ICHI-W), 2018, : 64 - 65
  • [9] Natural Language Processing Methods and Techniques for Knowledge Extraction from School Reports
    Venturi, Giulia
    Dell'Orletta, Felice
    Montemagni, Simonetta
    Morini, Elettra
    Sagri, Maria Teresa
    CADMO, 2020, (02): : 49 - +
  • [10] Extracting information on pneumonia in infants using natural language processing of radiology reports
    Mendonça, EA
    Haas, J
    Shagina, L
    Larson, E
    Friedman, C
    JOURNAL OF BIOMEDICAL INFORMATICS, 2005, 38 (04) : 314 - 321