Fiscal data in text: Information extraction from audit reports using Natural Language Processing

被引:0
|
作者
Beltran, Alejandro [1 ]
机构
[1] Alan Turing Inst, London, England
来源
DATA & POLICY | 2023年 / 5卷
关键词
auditing; corruption; natural language processing; subnational governments; text-as-data; CORRUPTION; MALFEASANCE;
D O I
10.1017/dap.2023.4
中图分类号
C93 [管理学]; D035 [国家行政管理]; D523 [行政管理]; D63 [国家行政管理];
学科分类号
12 ; 1201 ; 1202 ; 120202 ; 1204 ; 120401 ;
摘要
Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.Policy Significance Statement Annual audits by supreme audit institutions produce important information on the health and accuracy of govern-mental budgets. These reports include the monetary value of discrepancies, missing funds, and corrupt actions. This paper offers a strategy for collecting that information from historical audit reports and creating a database on budgetary discrepancies. It uses machine learning and natural language processing to accelerate and scale the collection of data to thousands of paragraphs. The granularity of the budgetary data obtained through this approach is useful to reformers and policymakers who require detailed data on municipal finances. This approach can also be applied to other countries that publish audit reports in PDF documents across different languages and contexts.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Using Natural Language Processing of Free-Text Radiology Reports to Identify Type 1 Modic Endplate Changes
    Huhdanpaa, Hannu T.
    Tan, W. Katherine
    Rundell, Sean D.
    Suri, Pradeep
    Chokshi, Falgun H.
    Comstock, Bryan A.
    Heagerty, Patrick J.
    James, Kathryn T.
    Avins, Andrew L.
    Nedeljkovic, Srdjan S.
    Nerenz, David R.
    Kallmes, David F.
    Luetmer, Patrick H.
    Sherman, Karen J.
    Organ, Nancy L.
    Griffith, Brent
    Langlotz, Curtis P.
    Carrell, David
    Hassanpour, Saeed
    Jarvik, Jeffrey G.
    JOURNAL OF DIGITAL IMAGING, 2018, 31 (01) : 84 - 90
  • [42] A scoping review of empathy recognition in text using natural language processing
    Shetty, Vishal Anand
    Durbin, Shauna
    Weyrich, Meghan S.
    Martinez, Airin Denise
    Qian, Jing
    Chin, David L.
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (03) : 762 - 775
  • [43] Using Natural Language Processing for Aftermarket Text to Increase Accuracy and Efficiency
    Hollingshead, Derek
    Parendo, Carol
    Peter, Priya
    2022 68TH ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM (RAMS 2022), 2022,
  • [44] Access Control Policy Extraction from Unconstrained Natural Language Text
    Slankas, John
    Williams, Laurie
    2013 ASE/IEEE INTERNATIONAL CONFERENCE ON SOCIAL COMPUTING (SOCIALCOM), 2013, : 435 - 440
  • [45] Automated Extraction of BI-RADS Final Assessment Categories from Radiology Reports with Natural Language Processing
    Sippo, Dorothy A.
    Warden, Graham I.
    Andriole, Katherine P.
    Lacson, Ronilda
    Ikuta, Ichiro
    Birdwell, Robyn L.
    Khorasani, Ramin
    JOURNAL OF DIGITAL IMAGING, 2013, 26 (05) : 989 - 994
  • [46] Accelerating Mixed Methods Research With Natural Language Processing of Big Text Data
    Chang, Tammy
    DeJonckheere, Melissa
    Vydiswaran, V. G. Vinod
    Li, Jiazhao
    Buis, Lorraine R.
    Guetterman, Timothy C.
    JOURNAL OF MIXED METHODS RESEARCH, 2021, 15 (03) : 398 - 412
  • [47] A SUMMARIZATION METHOD AUTOMATIC TEXT THROUGH STATISTICAL DATA AND NATURAL LANGUAGE PROCESSING
    de Souza, Osvaldo
    Tabosa, Hamilton Rodrigues
    de Oliveira, Davi Martins
    de Souza Oliveira, Mayra Helena
    INFORMACAO & SOCIEDADE-ESTUDOS, 2017, 27 (03) : 307 - 320
  • [48] Automated interpretation of stress echocardiography reports using natural language processing
    Zheng, Chengyi
    Sun, Benjamin C.
    Wu, Yi-Lin
    Ferencik, Maros
    Lee, Ming-Sum
    Redberg, Rita F.
    Kawatkar, Aniket A.
    Musigdilok, Visanee V.
    Sharp, Adam L.
    EUROPEAN HEART JOURNAL - DIGITAL HEALTH, 2022, 3 (04): : 626 - 637
  • [49] Facilitating cancer research using natural language processing of pathology reports
    Xu, H
    Anderson, K
    Grann, VR
    Friedman, C
    MEDINFO 2004: PROCEEDINGS OF THE 11TH WORLD CONGRESS ON MEDICAL INFORMATICS, PT 1 AND 2, 2004, 107 : 565 - 569
  • [50] Automated Extraction of BI-RADS Final Assessment Categories from Radiology Reports with Natural Language Processing
    Dorothy A. Sippo
    Graham I. Warden
    Katherine P. Andriole
    Ronilda Lacson
    Ichiro Ikuta
    Robyn L. Birdwell
    Ramin Khorasani
    Journal of Digital Imaging, 2013, 26 : 989 - 994