Automated classification of cancer morphology from Italian pathology reports using Natural Language Processing techniques: A rule-based approach

被引:19
|
作者
Lindaa, Hammami [1 ]
Alessia, Paglialonga [2 ]
Giancarlo, Pruneri [3 ,4 ]
Michele, Torresani [5 ]
Milenaa, Sant [1 ]
Carlo, Bono [6 ]
Gianluca, Caiani Enrico [2 ,7 ]
Paolo, Baili [1 ]
机构
[1] Fdn IRCCS Ist Nazl Tumori, Analyt Epidemiol & Hlth Impact Unit, Via Venezian 1, I-20133 Milan, Italy
[2] Natl Res Council Italy CNR, Inst Elect Comp & Telecommun Engn IEIIT, Milan, Italy
[3] Fdn IRCCS Ist Nazl Tumori, Pathol Dept, Milan, Italy
[4] Univ Milan, Sch Med, Milan, Italy
[5] Fdn IRCCS Ist Nazl Tumori, Hlth Direct, Milan, Italy
[6] Fdn IRCCS Ist Nazl Tumori, Milan, Italy
[7] Politecn Milan, Elect Informat & Biomed Engn Dept, Milan, Italy
关键词
Natural Language Processing; Italian language; Pathology Reports; Cancer morphology;
D O I
10.1016/j.jbi.2021.103712
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Pathology reports represent a primary source of information for cancer registries. Hospitals routinely process high volumes of free-text reports, a valuable source of information regarding cancer diagnosis for improving clinical care and supporting research. Information extraction and coding of textual unstructured data is typically a manual, labour-intensive process. There is a need to develop automated approaches to extract meaningful information from such texts in a reliable and accurate way. In this scenario, Natural Language Processing (NLP) algorithms offer a unique opportunity to automatically encode the unstructured reports into structured data, thus representing a potential powerful alternative to expensive manual processing. However, notwithstanding the increasing interest in this area, there is still limited availability of NLP approaches for pathology reports in languages other than English, including Italian, to date. The aim of our work was to develop an automated algorithm based on NLP techniques, able to identify and classify the morphological content of pathology reports in the Italian language with micro-averaged performance scores higher than 95%. Specifically, a novel, domainspecific classifier that uses linguistic rules was developed and tested on 27,239 pathology reports from a single Italian oncological centre, following the International Classification of Diseases for Oncology morphology classification standard (ICD-O-M). The proposed classification algorithm achieved successful results with a micro-F1 score of 98.14% on 9594 pathology reports in the test dataset. This algorithm relies on rules defined on data from a single hospital that is specifically dedicated to cancer, but it is based on general processing steps which can be applied to different datasets. Further research will be important to demonstrate the generalizability of the proposed approach on a larger corpus from different hospitals.
引用
收藏
页数:7
相关论文
共 50 条
  • [31] Anatomic stage extraction from medical reports of breast Cancer patients using natural language processing
    Pratiksha R. Deshmukh
    Rashmi Phalnikar
    Health and Technology, 2020, 10 : 1555 - 1570
  • [32] Two Rule-Based Natural Language Strategies for Requirements Discovery and Classification in Open Source Software Development Projects
    Vlas, Radu E.
    Robinson, William N.
    JOURNAL OF MANAGEMENT INFORMATION SYSTEMS, 2012, 28 (04) : 11 - 38
  • [33] From rule-based models to deep learning transformers architectures for natural language processing and sign language translation systems: survey, taxonomy and performance evaluation
    Shahin, Nada
    Ismail, Leila
    ARTIFICIAL INTELLIGENCE REVIEW, 2024, 57 (10)
  • [34] Hybrid Rule-based and Machine Learning System for Assertion Generation from Natural Language Specifications
    Aditi
    Hsiao, Michael S.
    2022 IEEE 31ST ASIAN TEST SYMPOSIUM (ATS 2022), 2022, : 126 - 131
  • [35] Using Natural Language Processing to Automatically Identify Dysplasia in Pathology Reports for Patients With Barrett's Esophagus
    Wenker, Theresa Nguyen
    Natarajan, Yamini
    Caskey, Kadon
    Novoa, Francisco
    Mansour, Nabil
    Pham, Huy Anh
    Hou, Jason K.
    El-Serag, Hashem B.
    Thrift, Aaron P.
    CLINICAL GASTROENTEROLOGY AND HEPATOLOGY, 2023, 21 (05) : 1198 - 1204
  • [36] Automated Detection of Radiology Reports that Require Follow-up Imaging Using Natural Language Processing Feature Engineering and Machine Learning Classification
    Lou, Robert
    Lalevic, Darco
    Chambers, Charles
    Zafar, Hanna M.
    Cook, Tessa S.
    JOURNAL OF DIGITAL IMAGING, 2020, 33 (01) : 131 - 136
  • [37] Information Extraction from Cancer Pathology Reports with Graph Convolution Networks for Natural Language Texts
    Yoon, Hong-Jun
    Gounley, John
    Young, M. Todd
    Tourassi, Georgia
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 4561 - 4564
  • [38] NLP4PBM: a systematic review on process extraction using natural language processing with rule-based, machine and deep learning methods
    Van Woensel, William
    Motie, Soroor
    ENTERPRISE INFORMATION SYSTEMS, 2024, 18 (11)
  • [39] Extracting social support and social isolation information from clinical psychiatry notes: comparing a rule-based natural language processing system and a large language model
    Patra, Braja Gopal
    Lepow, Lauren A.
    Kasi Reddy Jagadeesh Kumar, Praneet
    Vekaria, Veer
    Sharma, Mohit Manoj
    Adekkanattu, Prakash
    Fennessy, Brian
    Hynes, Gavin
    Landi, Isotta
    Sanchez-Ruiz, Jorge A.
    Ryu, Euijung
    Biernacka, Joanna M.
    Nadkarni, Girish N.
    Talati, Ardesheer
    Weissman, Myrna
    Olfson, Mark
    Mann, J. John
    Zhang, Yiye
    Charney, Alexander W.
    Pathak, Jyotishman
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 32 (01) : 218 - 226
  • [40] Automated Detection of Radiology Reports that Require Follow-up Imaging Using Natural Language Processing Feature Engineering and Machine Learning Classification
    Robert Lou
    Darco Lalevic
    Charles Chambers
    Hanna M. Zafar
    Tessa S. Cook
    Journal of Digital Imaging, 2020, 33 : 131 - 136