Machine learning classification of surgical pathology reports and chunk recognition for information extraction noise reduction

被引:25
作者
Napolitano, Giulio [1 ]
Marshall, Adele [2 ]
Hamilton, Peter [3 ]
Gavin, Anna T. [4 ]
机构
[1] Univ Bonn, IMBIE, Haus 325-11-1-OG Raum 620,Sigmund Freud Str 25, D-53105 Bonn, Germany
[2] Queens Univ Belfast, Sch Math & Phys, Univ Rd, Belfast BT7 1NN, Antrim, North Ireland
[3] Queens Univ Belfast, Sch Med Dent & Biomed Sci, 97 Lisburn Rd, Belfast BT9 7BL, Antrim, North Ireland
[4] Queens Univ Belfast, NICR Ctr Publ Hlth, Mulhouse Bldg,Grosvenor Rd, Belfast BT12 6DP, Antrim, North Ireland
关键词
Natural language processing; Information extraction; Supervised machine learning; Surgical pathology report; Cancer staging; MEDICATION INFORMATION;
D O I
10.1016/j.artmed.2016.06.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Background and aims: Machine learning techniques for the text mining of cancer-related clinical documents have not been sufficiently explored. Here some techniques are presented for the pre-processing of free-text breast cancer pathology reports, with the aim of facilitating the extraction of information relevant to cancer staging. Materials and methods: The first technique was implemented using the freely available software Rapid Miner to classify the reports according to their general layout: 'semi-structured' and 'unstructured'. The second technique was developed using the open source language engineering framework GATE and aimed at the prediction of chunks of the report text containing information pertaining to the cancer morphology, the tumour size, its hormone receptor status and the number of positive nodes. The classifiers were trained and tested respectively on sets of 635 and 163 manually classified or annotated reports, from the Northern Ireland Cancer Registry. Results: The best result of 99.4% accuracy - which included only one semi-structured report predicted as unstructured - was produced by the layout classifier with the k nearest algorithm, using the binary term occurrence word vector type with stopword filter and pruning. For chunk recognition, the best results were found using the PAUM algorithm with the same parameters for all cases, except for the prediction of chunks containing cancer morphology. For semi-structured reports the performance ranged from 0.97 to 0.94 and from 0.92 to 0.83 in precision and recall, while for unstructured reports performance ranged from 0.91 to 0.64 and from 0.68 to 0.41 in precision and recall. Poor results were found when the classifier was trained on semi-structured reports but tested on unstructured. Conclusions: These results show that it is possible and beneficial to predict the layout of reports and that the accuracy of prediction of which segments of a report may contain certain information is sensitive to the report layout and the type of information sought. (C) 2016 Elsevier B.V. All rights reserved.
引用
收藏
页码:77 / 83
页数:7
相关论文
共 30 条
[1]  
Alpaydin E., 2004, Introduction to Machine Learning
[2]  
[Anonymous], UK ANN PERF IND
[3]  
[Anonymous], MED DATA MINING KNOW
[4]  
[Anonymous], P 21 INT C VER LARG
[5]  
[Anonymous], P 13 INT JOINT C ART
[6]  
[Anonymous], 1521 NHS ISB
[7]  
[Anonymous], CLEF 2015 ONLINE WOR
[8]  
[Anonymous], P 2015 INT C ADV COM
[9]  
[Anonymous], SYST NOM MED
[10]  
[Anonymous], NATURE