A Text Mining Approach in the Classification of Free-Text Cancer Pathology Reports from the South African National Health Laboratory Services

被引:6
|
作者
Achilonu, Okechinyere J. [1 ]
Olago, Victor [2 ]
Singh, Elvira [1 ,2 ]
Eijkemans, Rene M. J. C. [3 ]
Nimako, Gideon [1 ,4 ]
Musenge, Eustasius [1 ]
机构
[1] Univ Witwatersrand, Fac Hlth Sci, Sch Publ Hlth, Div Epidemiol & Biostat, ZA-2000 Johannesburg, South Africa
[2] Natl Hlth Lab Serv, Natl Canc Registry, 1 Modderfontein Rd, ZA-2131 Johannesburg, South Africa
[3] Univ Utrecht, Univ Med Ctr, Julius Ctr Hlth Sci & Primary Care, NL-3584 Utrecht, Netherlands
[4] African Union Dev Agcy AUDA NEPAD, Industrializat Sci Technol & Innovat Hub, ZA-1685 Johannesburg, South Africa
基金
英国惠康基金;
关键词
pathology reports; breast; colorectal; prostate; text mining; machine learning; support vector machine and random forest; QUALITY;
D O I
10.3390/info12110451
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A cancer pathology report is a valuable medical document that provides information for clinical management of the patient and evaluation of health care. However, there are variations in the quality of reporting in free-text style formats, ranging from comprehensive to incomplete reporting. Moreover, the increasing incidence of cancer has generated a high throughput of pathology reports. Hence, manual extraction and classification of information from these reports can be intrinsically complex and resource-intensive. This study aimed to (i) evaluate the quality of over 80,000 breast, colorectal, and prostate cancer free-text pathology reports and (ii) assess the effectiveness of random forest (RF) and variants of support vector machine (SVM) in the classification of reports into benign and malignant classes. The study approach comprises data preprocessing, visualisation, feature selections, text classification, and evaluation of performance metrics. The performance of the classifiers was evaluated across various feature sizes, which were jointly selected by four filter feature selection methods. The feature selection methods identified established clinical terms, which are synonymous with each of the three cancers. Uni-gram tokenisation using the classifiers showed that the predictive power of RF model was consistent across various feature sizes, with overall F-scores of 95.2%, 94.0%, and 95.3% for breast, colorectal, and prostate cancer classification, respectively. The radial SVM achieved better classification performance compared with its linear variant for most of the feature sizes. The classifiers also achieved high precision, recall, and accuracy. This study supports a nationally agreed standard in pathology reporting and the use of text mining for encoding, classifying, and production of high-quality information abstractions for cancer prognosis and research.
引用
收藏
页数:22
相关论文
共 35 条
  • [1] Automating classification of free-text electronic health records for epidemiological studies
    Schuemie, Martijn J.
    Sen, Emine
    't Jong, Geert W.
    van Soest, Eva M.
    Sturkenboom, Miriam C.
    Kors, Jan A.
    PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2012, 21 (06) : 651 - 658
  • [2] Identification of Malignancies from Free-Text Histopathology Reports Using a Multi-Model Supervised Machine Learning Approach
    Olago, Victor
    Muchengeti, Mazvita
    Singh, Elvira
    Chen, Wenlong C.
    INFORMATION, 2020, 11 (09)
  • [3] Machine Learning-Based Extraction of Breast Cancer Receptor Status From Bilingual Free-Text Pathology Reports
    Pironet, Antoine
    Poirel, Helene A.
    Tambuyzer, Tim
    De Schutter, Harlinde
    van Walle, Lien
    Mattheijssens, Joris
    Henau, Kris
    Van Eycken, Liesbet
    Van Damme, Nancy
    FRONTIERS IN DIGITAL HEALTH, 2021, 3
  • [4] Automated Classification of Selected Data Elements from Free-text Diagnostic Reports for Clinical Research
    Loepprich, Martin
    Krauss, Felix
    Ganzinger, Matthias
    Senghas, Karsten
    Riezler, Stefan
    Knaup, Petra
    METHODS OF INFORMATION IN MEDICINE, 2016, 55 (04) : 373 - 380
  • [5] Identifying risks areas related to medication administrations - text mining analysis using free-text descriptions of incident reports
    Marja Härkänen
    Jussi Paananen
    Trevor Murrells
    Anne Marie Rafferty
    Bryony Dean Franklin
    BMC Health Services Research, 19
  • [6] Identifying risks areas related to medication administrations-text mining analysis using free-text descriptions of incident reports
    Harkanen, Marja
    Paananen, Jussi
    Murrells, Trevor
    Rafferty, Anne Marie
    Franklin, Bryony Dean
    BMC HEALTH SERVICES RESEARCH, 2019, 19 (01)
  • [7] Identifying free-text features to improve automated classification of structured histopathology reports for feline small intestinal disease
    Awaysheh, Abdullah
    Wilcke, Jeffrey
    Elvinger, Francois
    Rees, Loren
    Fan, Weiguo
    Zimmerman, Kurt
    JOURNAL OF VETERINARY DIAGNOSTIC INVESTIGATION, 2018, 30 (02) : 211 - 217
  • [8] Automated classification of limb fractures from free-text radiology reports using a clinician-informed gazetteer methodology
    Wagholikar, Amol
    Zuccon, Guido
    Nguyen, Anthony
    Chu, Kevin
    Martin, Shane
    Lai, Kim
    Greenslade, Jaimi
    AUSTRALASIAN MEDICAL JOURNAL, 2013, 6 (05): : 301 - 307
  • [9] Exploring the Association of Cancer and Depression in Electronic Health Records: Combining Encoded Diagnosis and Mining Free-Text Clinical Notes
    Leis, Angela
    Casadevall, David
    Albanell, Joan
    Posso, Margarita
    Macia, Francesc
    Castells, Xavier
    Ramirez-Anguita, Juan Manuel
    Roldan, Jordi Martinez
    Furlong, Laura, I
    Sanz, Ferran
    Ronzano, Francesco
    Mayer, Miguel A.
    JMIR CANCER, 2022, 8 (03):
  • [10] A Natural Language Processing Pipeline of Chinese Free-Text Radiology Reports for Liver Cancer Diagnosis
    Liu, Honglei
    Xu, Yan
    Zhang, Zhiqiang
    Wang, Ni
    Huang, Yanqun
    Hu, Yanjun
    Yang, Zhenghan
    Jiang, Rui
    Chen, Hui
    IEEE ACCESS, 2020, 8 : 159110 - 159119