Applying active learning to supervised word sense disambiguation in MEDLINE

被引：16

作者：

Chen, Yukun ^{[1
]}

Cao, Hongxin ^{[2
]}

Mei, Qiaozhu ^{[3
,4
]}

Zheng, Kai ^{[4
,5
]}

Xu, Hua ^{[1
,6
]}

机构：

[1] Vanderbilt Univ, Sch Med, Dept Biomed Informat, Nashville, TN 37212 USA

[2] Second Mil Med Univ, Dept Med Informat, Shanghai, Peoples R China

[3] Univ Michigan, Sch Informat, Ann Arbor, MI 48109 USA

[4] Univ Michigan, Dept Elect Engn & Comp Sci, Ann Arbor, MI 48109 USA

[5] Univ Michigan, Dept Hlth Management & Policy, Ann Arbor, MI 48109 USA

[6] Univ Texas Hlth Sci Ctr Houston, Sch Biomed Informat, Houston, TX 77030 USA

来源：

JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION | 2013年 / 20卷 / 05期

关键词：

Active Learning; Word Sense Disambiguation; Natural Language Processing; Machine Learning; Uncertainty Sampling; Annotation; ABBREVIATIONS; GENE;

D O I：

10.1136/amiajnl-2012-001244

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Objectives This study was to assess whether active learning strategies can be integrated with supervised word sense disambiguation (WSD) methods, thus reducing the number of annotated samples, while keeping or improving the quality of disambiguation models. Methods We developed support vector machine (SVM) classifiers to disambiguate 197 ambiguous terms and abbreviations in the MSH WSD collection. Three different uncertainty sampling-based active learning algorithms were implemented with the SVM classifiers and were compared with a passive learner (PL) based on random sampling. For each ambiguous term and each learning algorithm, a learning curve that plots the accuracy computed from the test set as a function of the number of annotated samples used in the model was generated. The area under the learning curve (ALC) was used as the primary metric for evaluation. Results Our experiments demonstrated that active learners (ALs) significantly outperformed the PL, showing better performance for 177 out of 197 (89.8%) WSD tasks. Further analysis showed that to achieve an average accuracy of 90%, the PL needed 38 annotated samples, while the ALs needed only 24, a 37% reduction in annotation effort. Moreover, we analyzed cases where active learning algorithms did not achieve superior performance and identified three causes: (1) poor models in the early learning stage; (2) easy WSD cases; and (3) difficult WSD cases, which provide useful insight for future improvements. Conclusions This study demonstrated that integrating active learning strategies with supervised WSD methods could effectively reduce annotation cost and improve the disambiguation models.

引用

页码：1001 / 1006

页数：6

共 31 条

[1]

[Anonymous], 2006, P HUM LANG TECHN C N

[2]

[Anonymous], 1996, C EMP METH NAT LANG, P82

[3]

[Anonymous], 1987, Multiple comparison procedures

[4]

[Anonymous], SENSEVAL 01 P 2 INT, P123

[5]

BRUCE R, 1994, 32ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, P139

[6] Gene name ambiguity of eukaryotic nomenclatures [J].

Chen, LF ;

Liu, HF ;

Friedman, C .

BIOINFORMATICS, 2005, 21 (02) :248-256

[7] Applying active learning to assertion classification of concepts in clinical text [J].

Chen, Yukun ;

Mani, Subramani ;

Xu, Hua .

JOURNAL OF BIOMEDICAL INFORMATICS, 2012, 45 (02) :265-272

[8]

Fan RE, 2008, J MACH LEARN RES, V9, P1871

[9]

Figueroa R.L., 2012, J AM MED INFORM ASS

[10] Gene and protein nomenclature in public databases [J].

Fundel, Katrin ;

Zimmer, Ralf .

BMC BIOINFORMATICS, 2006, 7 (1)

← 1 2 3 4 →