A Text Mining Approach in the Classification of Free-Text Cancer Pathology Reports from the South African National Health Laboratory Services

被引:6
作者
Achilonu, Okechinyere J. [1 ]
Olago, Victor [2 ]
Singh, Elvira [1 ,2 ]
Eijkemans, Rene M. J. C. [3 ]
Nimako, Gideon [1 ,4 ]
Musenge, Eustasius [1 ]
机构
[1] Univ Witwatersrand, Fac Hlth Sci, Sch Publ Hlth, Div Epidemiol & Biostat, ZA-2000 Johannesburg, South Africa
[2] Natl Hlth Lab Serv, Natl Canc Registry, 1 Modderfontein Rd, ZA-2131 Johannesburg, South Africa
[3] Univ Utrecht, Univ Med Ctr, Julius Ctr Hlth Sci & Primary Care, NL-3584 Utrecht, Netherlands
[4] African Union Dev Agcy AUDA NEPAD, Industrializat Sci Technol & Innovat Hub, ZA-1685 Johannesburg, South Africa
基金
英国惠康基金;
关键词
pathology reports; breast; colorectal; prostate; text mining; machine learning; support vector machine and random forest; QUALITY;
D O I
10.3390/info12110451
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A cancer pathology report is a valuable medical document that provides information for clinical management of the patient and evaluation of health care. However, there are variations in the quality of reporting in free-text style formats, ranging from comprehensive to incomplete reporting. Moreover, the increasing incidence of cancer has generated a high throughput of pathology reports. Hence, manual extraction and classification of information from these reports can be intrinsically complex and resource-intensive. This study aimed to (i) evaluate the quality of over 80,000 breast, colorectal, and prostate cancer free-text pathology reports and (ii) assess the effectiveness of random forest (RF) and variants of support vector machine (SVM) in the classification of reports into benign and malignant classes. The study approach comprises data preprocessing, visualisation, feature selections, text classification, and evaluation of performance metrics. The performance of the classifiers was evaluated across various feature sizes, which were jointly selected by four filter feature selection methods. The feature selection methods identified established clinical terms, which are synonymous with each of the three cancers. Uni-gram tokenisation using the classifiers showed that the predictive power of RF model was consistent across various feature sizes, with overall F-scores of 95.2%, 94.0%, and 95.3% for breast, colorectal, and prostate cancer classification, respectively. The radial SVM achieved better classification performance compared with its linear variant for most of the feature sizes. The classifiers also achieved high precision, recall, and accuracy. This study supports a nationally agreed standard in pathology reporting and the use of text mining for encoding, classifying, and production of high-quality information abstractions for cancer prognosis and research.
引用
收藏
页数:22
相关论文
共 35 条
  • [21] Prioritising national healthcare service issues from free text feedback - A computational text analysis & predictive modelling approach
    Ojo, Adegboyega
    Rizun, Nina
    Walsh, Grace
    Mashinchi, Mona Isazad
    Venosa, Maria
    Rao, Manohar Narayana
    DECISION SUPPORT SYSTEMS, 2024, 181
  • [22] Deployment of a Free-Text Analytics Platform at a UK National Health Service Research Hospital: CogStack at University College London Hospitals
    Noor, Kawsar
    Roguski, Lukasz
    Bai, Xi
    Handy, Alex
    Klapaukh, Roman
    Folarin, Amos
    Romao, Luis
    Matteson, Joshua
    Lea, Nathan
    Zhu, Leilei
    Asselbergs, Folkert W.
    Wong, Wai Keong
    Shah, Anoop
    Dobson, Richard J. B.
    JMIR MEDICAL INFORMATICS, 2022, 10 (08)
  • [23] An improved text mining approach to extract safety risk factors from construction accident reports
    Xu, Na
    Ma, Ling
    Liu, Qing
    Wang, Li
    Deng, Yongliang
    SAFETY SCIENCE, 2021, 138
  • [24] A robust classification approach to enhance clinic identification from Arabic health text
    Al-Fuqaha'a, Shrouq
    Al-Madi, Nailah
    Hammo, Bassam
    NEURAL COMPUTING & APPLICATIONS, 2024, 36 (13) : 7161 - 7185
  • [25] A robust classification approach to enhance clinic identification from Arabic health text
    Shrouq Al-Fuqaha’a
    Nailah Al-Madi
    Bassam Hammo
    Neural Computing and Applications, 2024, 36 : 7161 - 7185
  • [26] Using text mining techniques to extract prostate cancer predictive information (Gleason score) from semi-structured narrative laboratory reports in the Gauteng province, South Africa
    Naseem Cassim
    Michael Mapundu
    Victor Olago
    Turgay Celik
    Jaya Anna George
    Deborah Kim Glencross
    BMC Medical Informatics and Decision Making, 21
  • [27] Using text mining techniques to extract prostate cancer predictive information (Gleason score) from semi-structured narrative laboratory reports in the Gauteng province, South Africa
    Cassim, Naseem
    Mapundu, Michael
    Olago, Victor
    Celik, Turgay
    George, Jaya Anna
    Glencross, Deborah Kim
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2021, 21 (01)
  • [28] Selection of diagnosis with oncologic relevance information from histopathology free text reports: A machine learning approach
    Viscosi, Carmelo
    Fidelbo, Paolo
    Benedetto, Andrea
    Varvara, Massimo
    Ferrante, Margherita
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2022, 160
  • [29] Natural Language Processing Algorithm Used for Staging Pulmonary Oncology from Free-Text Radiological Reports: "Including PET-CT and Validation Towards Clinical Use"
    Nobel, J. Martijn
    Puts, Sander
    Krdzalic, Jasenko
    Zegers, Karen M. L.
    Lobbes, Marc B. I.
    Robben, Simon G. F.
    Dekker, Andre L. A. J.
    JOURNAL OF IMAGING INFORMATICS IN MEDICINE, 2024, 37 (01): : 3 - 12
  • [30] Intelligent information extraction from government on-site inspection reports of construction projects: A graph-based text mining approach
    Liu, Muyang
    Luo, Xiaowei
    Wang, Guangbin
    Lu, Wei-Zhen
    ADVANCED ENGINEERING INFORMATICS, 2023, 58