Text mining of cancer-related information: Review of current status and future directions

被引:148
作者
Spasic, Irena [1 ]
Livsey, Jacqueline [2 ]
Keane, John A. [3 ,4 ,5 ]
Nenadic, Goran [3 ,4 ,5 ]
机构
[1] Cardiff Univ, Sch Comp Sci & Informat, Cardiff CF24 3AA, S Glam, Wales
[2] Christie NHS Fdn Trust, Clin Outcomes Unit, Manchester M20 4BX, Lancs, England
[3] Univ Manchester, Sch Comp Sci, Manchester M13 9PL, Lancs, England
[4] Hlth E Res Ctr, Manchester M13 9PL, Lancs, England
[5] Manchester Inst Biotecnol, Manchester M1 7DN, Lancs, England
关键词
Cancer; Natural language processing; Data mining; Electronic medical records; OF-THE-ART; CLINICAL INFORMATION; GENE METHYLATION; DATABASE; EXTRACTION; SYSTEM; CLASSIFICATION; ONTOLOGY; TOOL; MEINFOTEXT;
D O I
10.1016/j.ijmedinf.2014.06.009
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Purpose: This paper reviews the research literature on text mining (TM) with the aim to find out (1) which cancer domains have been the subject of TM efforts, (2) which knowledge resources can support TM of cancer-related information and (3) to what extent systems that rely on knowledge and computational methods can convert text data into useful clinical information. These questions were used to determine the current state of the art in this particular strand of TM and suggest future directions in TM development to support cancer research. Methods: A review of the research on TM of cancer-related information was carried out. A literature search was conducted on the Medline database as well as IEEE Xplore and ACM digital libraries to address the interdisciplinary nature of such research. The search results were supplemented with the literature identified through Google Scholar. Results: A range of studies have proven the feasibility of TM for extracting structured information from clinical narratives such as those found in pathology or radiology reports. In this article, we provide a critical overview of the current state of the art for TM related to cancer. The review highlighted a strong bias towards symbolic methods, e.g. named entity recognition (NER) based on dictionary lookup and information extraction (IE) relying on pattern matching. The F-measure of NER ranges between 80% and 90%, while that of IE for simple tasks is in the high 90s. To further improve the performance, TM approaches need to deal effectively with idiosyncrasies of the clinical sublanguage such as non-standard abbreviations as well as a high degree of spelling and grammatical errors. This requires a shift from rule-based methods to machine learning following the success of similar trends in biological applications of TM. Machine learning approaches require large training datasets, but clinical narratives are not readily available for TM research due to privacy and confidentiality concerns. This issue remains the main bottleneck for progress in this area. In addition, there is a need for a comprehensive cancer ontology that would enable semantic representation of textual information found in narrative reports. (C) 2014 The Authors. Published by Elsevier Ireland Ltd.
引用
收藏
页码:605 / 623
页数:19
相关论文
共 88 条
[1]   CancerResource: a comprehensive database of cancer-relevant proteins and compound interactions supported by experimental knowledge [J].
Ahmed, Jessica ;
Meinel, Thomas ;
Dunkel, Mathias ;
Murgueitio, Manuela S. ;
Adams, Robert ;
Blasse, Corinna ;
Eckert, Andreas ;
Preissner, Saskia ;
Preissner, Robert .
NUCLEIC ACIDS RESEARCH, 2011, 39 :D960-D967
[2]  
[Anonymous], DEATHS REG ENGL WAL
[3]  
[Anonymous], 2013, SNOMED CT
[4]  
[Anonymous], LREC INT C LANG RES
[5]  
Aronson AR., 2001, Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program.. s.l, P17
[6]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[7]   The Breast Cancer Gene Database: a collaborative information resource [J].
Baasiri, RA ;
Glasser, SR ;
Steffen, DL ;
Wheeler, DA .
ONCOGENE, 1999, 18 (56) :7958-7965
[8]   Searching for Cancer Information on the Internet: Analyzing Natural Language Search Queries [J].
Bader, Judith L. ;
Theofanos, Mary Frances .
JOURNAL OF MEDICAL INTERNET RESEARCH, 2003, 5 (04)
[9]   CGMIM: Automated text-mining of Online Mendelian Inheritance in Man (OMIM) to identify genetically-associated cancers and candidate genes [J].
Bajdik, CD ;
Kuo, B ;
Rusaw, S ;
Jones, S ;
Brooks-Wilson, A .
BMC BIOINFORMATICS, 2005, 6 (1)
[10]   Confidentiality issues for medical data miners [J].
Berman, JJ .
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2002, 26 (1-2) :25-36