Word sense disambiguation in the clinical domain: a comparison of knowledge-rich and knowledge-poor unsupervised methods

被引:20
作者
Chasin, Rachel [1 ]
Rumshisky, Anna [2 ]
Uzuner, Ozlem [3 ]
Szolovitsl, Peter [1 ]
机构
[1] MIT, Cambridge, MA 02139 USA
[2] Univ Massachusetts, Dept Comp Sci, Lowell, MA 01854 USA
[3] SUNY Albany, Dept Informat Studies, Albany, NY 12222 USA
关键词
D O I
10.1136/amiajnl-2013-002133
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective To evaluate state-of-the-art unsupervised methods on the word sense disambiguation (WSD) task in the clinical domain. In particular, to compare graph-based approaches relying on a clinical knowledge base with bottom-up topic-modeling-based approaches. We investigate several enhancements to the topic-modeling techniques that use domain-specific knowledge sources. Materials and methods The graph-based methods use variations of PageRank and distance-based similarity metrics, operating over the Unified Medical Language System (UMLS). Topic-modeling methods use unlabeled data from the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC II) database to derive models for each ambiguous word. We investigate the impact of using different linguistic features for topic models, including UMLS-based and syntactic features. We use a sense-tagged clinical dataset from the Mayo Clinic for evaluation. Results The topic-modeling methods achieve 66.9% accuracy on a subset of the Mayo Clinic's data, while the graph-based methods only reach the 40-50% range, with a most-frequent-sense baseline of 56.5%. Features derived from the UMLS semantic type and concept hierarchies do not produce a gain over bag-of-words features in the topic models, but identifying phrases from UMLS and using syntax does help. Discussion Although topic models outperform graph-based methods, semantic features derived from the UMLS prove too noisy to improve performance beyond bag-of-words. Conclusions Topic modeling for WSD provides superior results in the clinical domain; however, integration of knowledge remains to be effectively exploited.
引用
收藏
页码:842 / 849
页数:8
相关论文
共 35 条
[21]  
Page L., 1999, Technical Report Stanford InfoLab
[22]  
Pakhomov Sergeui, 2005, AMIA Annu Symp Proc, P589
[23]  
Pedersen T., 2010, P 5 INT WORKSH SEM E, P363
[24]  
Pedersen T., 2006, P 2006 C N AM CHAPT, P276, DOI DOI 10.3115/1225785.1225792
[25]  
Phan X-H, 2007, GIBBSLDA C C IMPLEME
[26]   Multiparameter Intelligent Monitoring in Intensive Care II: A public-access intensive care unit database [J].
Saeed, Mohammed ;
Villarroel, Mauricio ;
Reisner, Andrew T. ;
Clifford, Gari ;
Lehman, Li-Wei ;
Moody, George ;
Heldt, Thomas ;
Kyaw, Tin H. ;
Moody, Benjamin ;
Mark, Roger G. .
CRITICAL CARE MEDICINE, 2011, 39 (05) :952-960
[27]  
Savova G., 2006, Procs. of the Workshop on Making Sense of Sense: Bringing Psycholinguistics and Computational Linguistics Together, at the European Chapter of the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, P9
[28]  
Savova GK, 2005, TECHNICAL REPORT
[29]   Word sense disambiguation across two domains: Biomedical literature and clinical notes [J].
Savova, Guergana K. ;
Coden, Anni R. ;
Sominsky, Igor L. ;
Johnson, Rie ;
Ogren, Philip V. ;
de Groen, Piet C. ;
Chute, Christopher G. .
JOURNAL OF BIOMEDICAL INFORMATICS, 2008, 41 (06) :1088-1100
[30]   Exploiting domain information for Word Sense Disambiguation of medical documents [J].
Stevenson, Mark ;
Agirre, Eneko ;
Soroa, Aitor .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2012, 19 (02) :235-240