Fiscal data in text: Information extraction from audit reports using Natural Language Processing

被引:0
|
作者
Beltran, Alejandro [1 ]
机构
[1] Alan Turing Inst, London, England
来源
DATA & POLICY | 2023年 / 5卷
关键词
auditing; corruption; natural language processing; subnational governments; text-as-data; CORRUPTION; MALFEASANCE;
D O I
10.1017/dap.2023.4
中图分类号
C93 [管理学]; D035 [国家行政管理]; D523 [行政管理]; D63 [国家行政管理];
学科分类号
12 ; 1201 ; 1202 ; 120202 ; 1204 ; 120401 ;
摘要
Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.Policy Significance Statement Annual audits by supreme audit institutions produce important information on the health and accuracy of govern-mental budgets. These reports include the monetary value of discrepancies, missing funds, and corrupt actions. This paper offers a strategy for collecting that information from historical audit reports and creating a database on budgetary discrepancies. It uses machine learning and natural language processing to accelerate and scale the collection of data to thousands of paragraphs. The granularity of the budgetary data obtained through this approach is useful to reformers and policymakers who require detailed data on municipal finances. This approach can also be applied to other countries that publish audit reports in PDF documents across different languages and contexts.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] ACADEMIC TEXT CLUSTERING USING NATURAL LANGUAGE PROCESSING
    Taskiran, Salimkan Fatma
    Kaya, Ersin
    KONYA JOURNAL OF ENGINEERING SCIENCES, 2022, 10 : 41 - 51
  • [22] Integrated natural language processing method for text mining and visualization of underground engineering text reports
    Shao, Ruiqi
    Lin, Peng
    Xu, Zhenhao
    AUTOMATION IN CONSTRUCTION, 2024, 166
  • [23] The application of natural language processing for the extraction of mechanistic information in toxicology
    Corradi, Marie
    Luechtefeld, Thomas
    de Haan, Alyanne M.
    Pieters, Raymond
    Freedman, Jonathan H.
    Vanhaecke, Tamara
    Vinken, Mathieu
    Teunis, Marc
    FRONTIERS IN TOXICOLOGY, 2024, 6
  • [24] Causal Discovery from Natural Language Text using Context and Dependency Information
    Mitra, Shania
    Tangirala, Arun K.
    2022 61ST ANNUAL CONFERENCE OF THE SOCIETY OF INSTRUMENT AND CONTROL ENGINEERS (SICE), 2022, : 236 - 241
  • [25] Extraction of sleep information from clinical notes of Alzheimer's disease patients using natural language processing
    Sivarajkumar, Sonish
    Tam, Thomas Yu Chow
    Mohammad, Haneef Ahamed
    Viggiano, Samuel
    Oniani, David
    Visweswaran, Shyam
    Wang, Yanshan
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (10) : 2217 - 2227
  • [26] Information Extraction from Cancer Pathology Reports with Graph Convolution Networks for Natural Language Texts
    Yoon, Hong-Jun
    Gounley, John
    Young, M. Todd
    Tourassi, Georgia
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 4561 - 4564
  • [27] Biomolecular Event Extraction using Natural Language Processing
    Bali, Manish
    Anandaraj, S. P.
    INTERNATIONAL JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING SYSTEMS, 2023, 14 (05) : 601 - 612
  • [28] Rules based Event Extraction from Natural language Text
    Guda, Vanitha
    Sanampudi, Suresh Kumar
    2016 IEEE INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ELECTRONICS, INFORMATION & COMMUNICATION TECHNOLOGY (RTEICT), 2016, : 9 - 13
  • [29] Applications of Natural Language Processing to Geoscience Text Data and Prospectivity Modeling
    Christopher J. M. Lawley
    Michael G. Gadd
    Mohammad Parsa
    Graham W. Lederer
    Garth E. Graham
    Arianne Ford
    Natural Resources Research, 2023, 32 : 1503 - 1527
  • [30] Using natural language processing to extract structured epilepsy data from unstructured clinic letters: development and validation of the ExECT (extraction of epilepsy clinical text) system
    Fonferko-Shadrach, Beata
    Lacey, Arron S.
    Roberts, Angus
    Akbari, Ashley
    Thompson, Simon
    Ford, David V.
    Lyons, Ronan A.
    Rees, Mark I.
    Pickrell, William Owen
    BMJ OPEN, 2019, 9 (04):