Usability-driven pruning of large ontologies: the case of SNOMED CT

被引:11
作者
Lopez-Garcia, Pablo [1 ]
Boeker, Martin [2 ]
Illarramendi, Arantza [1 ]
Schulz, Stefan [2 ,3 ]
机构
[1] Univ Basque Country, Dept Lenguajes & Sistemas Informat, Donostia San 20008, Sebastian, Spain
[2] Univ Freiburg, Inst Med Biometrie & Med Informat, D-79106 Freiburg, Germany
[3] Med Univ Graz, Inst Med Informat Stat & Dokument, Graz, Austria
关键词
D O I
10.1136/amiajnl-2011-000503
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objectives To study ontology modularization techniques when applied to SNOMED CT in a scenario in which no previous corpus of information exists and to examine if frequency-based filtering using MEDLINE can reduce subset size without discarding relevant concepts. Materials and Methods Subsets were first extracted using four graph-traversal heuristics and one logic-based technique, and were subsequently filtered with frequency information from MEDLINE. Twenty manually coded discharge summaries from cardiology patients were used as signatures and test sets. The coverage, size, and precision of extracted subsets were measured. Results Graph-traversal heuristics provided high coverage (71-96% of terms in the test sets of discharge summaries) at the expense of subset size (17-51% of the size of SNOMED CT). Pre-computed subsets and logic-based techniques extracted small subsets (1%), but coverage was limited (24-55%). Filtering reduced the size of large subsets to 10% while still providing 80% coverage. Discussion Extracting subsets to annotate discharge summaries is challenging when no previous corpus exists. Ontology modularization provides valuable techniques, but the resulting modules grow as signatures spread across subhierarchies, yielding a very low precision. Conclusion Graph-traversal strategies and frequency data from an authoritative source can prune large biomedical ontologies and produce useful subsets that still exhibit acceptable coverage. However, a clinical corpus closer to the specific use case is preferred when available.
引用
收藏
页码:E102 / E109
页数:8
相关论文
共 24 条
[1]  
[Anonymous], 2003, DESCRIPTION LOGIC HD
[2]  
[Anonymous], SNOMED CT TECHN IMPL
[3]  
Baader F, 2005, 19TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-05), P364
[4]  
Benson T, 2010, HEALTH INFORM SER, P1, DOI 10.1007/978-1-84882-803-2
[5]  
D'Aquin M, 2007, 18 INT C INF KNOWL M, P874
[6]  
d'Aquin M, 2009, LECT NOTES COMPUT SC, V5445, P67
[7]  
Doran P, 2007, 16 ACM C INF KNOWL M, P13
[8]   The UMLS-CORE project: a study of the problem list terminologies used in large healthcare institutions [J].
Fung, Kin Wah ;
McDonald, Clement ;
Srinivasan, Suresh .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (06) :675-680
[9]   Modular reuse of ontologies: Theory and practice [J].
Grau, Bernardo Cuenca ;
Horrocks, Ian ;
Kazakov, Yevgeny ;
Sattler, Ulrike .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2008, 31 :273-318
[10]  
Grau BC, 2007, LECT NOTES COMPUT SC, V4825, P183