Fiscal data in text: Information extraction from audit reports using Natural Language Processing

被引:0
|
作者
Beltran, Alejandro [1 ]
机构
[1] Alan Turing Inst, London, England
来源
DATA & POLICY | 2023年 / 5卷
关键词
auditing; corruption; natural language processing; subnational governments; text-as-data; CORRUPTION; MALFEASANCE;
D O I
10.1017/dap.2023.4
中图分类号
C93 [管理学]; D035 [国家行政管理]; D523 [行政管理]; D63 [国家行政管理];
学科分类号
12 ; 1201 ; 1202 ; 120202 ; 1204 ; 120401 ;
摘要
Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.Policy Significance Statement Annual audits by supreme audit institutions produce important information on the health and accuracy of govern-mental budgets. These reports include the monetary value of discrepancies, missing funds, and corrupt actions. This paper offers a strategy for collecting that information from historical audit reports and creating a database on budgetary discrepancies. It uses machine learning and natural language processing to accelerate and scale the collection of data to thousands of paragraphs. The granularity of the budgetary data obtained through this approach is useful to reformers and policymakers who require detailed data on municipal finances. This approach can also be applied to other countries that publish audit reports in PDF documents across different languages and contexts.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Applications of Natural Language Processing to Geoscience Text Data and Prospectivity Modeling
    Lawley, Christopher J. M.
    Gadd, Michael G.
    Parsa, Mohammad
    Lederer, Graham W.
    Graham, Garth E.
    Ford, Arianne
    NATURAL RESOURCES RESEARCH, 2023, 32 (04) : 1503 - 1527
  • [32] Information retrieval in falktales using natural language processing
    Groza, Adrian
    Corde, Lidia
    2015 IEEE 11TH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTER COMMUNICATION AND PROCESSING (ICCP), 2015, : 59 - 66
  • [33] Natural Language Processing to identify pneumonia from radiology reports
    Dublin, Sascha
    Baldwin, Eric
    Walker, Rod L.
    Christensen, Lee M.
    Haug, Peter J.
    Jackson, Michael L.
    Nelson, Jennifer C.
    Ferraro, Jeffrey
    Carrell, David
    Chapman, Wendy W.
    PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2013, 22 (08) : 834 - 841
  • [34] PDF text classification to leverage information extraction from publication reports
    Duy Duc An Bui
    Del Fiol, Guilherme
    Jonnalagadda, Siddhartha
    JOURNAL OF BIOMEDICAL INFORMATICS, 2016, 61 : 141 - 148
  • [35] Automatic Lung Cancer Staging from Medical Reports Using Natural Language Processing
    Sui, X.
    Liu, T.
    Huang, Q.
    Hou, Y.
    Wang, Y.
    Kang, G.
    Guo, H.
    Li, N.
    Li, Y.
    Wang, Z.
    Wang, J.
    JOURNAL OF THORACIC ONCOLOGY, 2018, 13 (10) : S772 - S772
  • [36] A systematic review of natural language processing and text mining of symptoms from electronic patient-authored text data
    Dreisbach, Caitlin
    Koleck, Theresa A.
    Bourne, Philip E.
    Bakken, Suzanne
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2019, 125 : 37 - 46
  • [37] Entity Extraction of Electrical Equipment Malfunction Text by a Hybrid Natural Language Processing Algorithm
    Kong, Zhe
    Yue, Changxi
    Shi, Ying
    Yu, Jicheng
    Xie, Changjun
    Xie, Lingyun
    IEEE ACCESS, 2021, 9 : 40216 - 40226
  • [38] USING NATURAL LANGUAGE PROCESSING FOR AUTOMATIC EXTRACTION OF ONTOLOGY INSTANCES
    Faria, Carla
    Girardi, Rosario
    Serra, Ivo
    Macedo, Maria
    Maranhao, Djefferson
    ICEIS 2010: PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS, VOL 2: ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS, 2010, : 278 - 283
  • [39] Keyword Extraction in Economics Literatures using Natural Language Processing
    Kim, Soojeong
    Choi, Sunho
    Seok, Junhee
    12TH INTERNATIONAL CONFERENCE ON UBIQUITOUS AND FUTURE NETWORKS (ICUFN 2021), 2021, : 75 - 77
  • [40] Using Natural Language Processing of Free-Text Radiology Reports to Identify Type 1 Modic Endplate Changes
    Hannu T. Huhdanpaa
    W. Katherine Tan
    Sean D. Rundell
    Pradeep Suri
    Falgun H. Chokshi
    Bryan A. Comstock
    Patrick J. Heagerty
    Kathryn T. James
    Andrew L. Avins
    Srdjan S. Nedeljkovic
    David R. Nerenz
    David F. Kallmes
    Patrick H. Luetmer
    Karen J. Sherman
    Nancy L. Organ
    Brent Griffith
    Curtis P. Langlotz
    David Carrell
    Saeed Hassanpour
    Jeffrey G. Jarvik
    Journal of Digital Imaging, 2018, 31 : 84 - 90