A comprehensive study of named entity recognition in Chinese clinical text

被引:140
作者
Lei, Jianbo [1 ,2 ]
Tang, Buzhou [2 ,3 ]
Lu, Xueqin [1 ]
Gao, Kaihua [1 ]
Jiang, Min [2 ]
Xu, Hua [2 ]
机构
[1] Peking Univ, Ctr Med Informat, Beijing 100871, Peoples R China
[2] Univ Texas Sch Biomed Informat Houston, Houston, TX USA
[3] Shenzhen Grad Sch, Harbin Inst Technol, Dept Comp Sci, Shenzhen, Guangdong, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
INFORMATION; ASSERTIONS;
D O I
10.1136/amiajnl-2013-002381
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective Named entity recognition (NER) is one of the fundamental tasks in natural language processing. In the medical domain, there have been a number of studies on NER in English clinical notes; however, very limited NER research has been carried out on clinical notes written in Chinese. The goal of this study was to systematically investigate features and machine learning algorithms for NER in Chinese clinical text. Materials and methods We randomly selected 400 admission notes and 400 discharge summaries from Peking Union Medical College Hospital in China. For each note, four types of entity clinical problems, procedures, laboratory test, and medications were annotated according to a predefined guideline. Two-thirds of the 400 notes were used to train the NER systems and one-third for testing. We investigated the effects of different types of feature including bag-of-characters, word segmentation, part-of-speech, and section information, and different machine learning algorithms including conditional random fields (CRF), support vector machines (SVM), maximum entropy (ME), and structural SVM (SSVM) on the Chinese clinical NER task. All classifiers were trained on the training dataset and evaluated on the test set, and micro-averaged precision, recall, and F-measure were reported. Results Our evaluation on the independent test set showed that most types of feature were beneficial to Chinese NER systems, although the improvements were limited. The system achieved the highest performance by combining word segmentation and section information, indicating that these two types of feature complement each other. When the same types of optimized feature were used, CRF and SSVM outperformed SVM and ME. More specifically, SSVM achieved the highest performance of the four algorithms, with F-measures of 93.51% and 90.01% for admission notes and discharge summaries, respectively.
引用
收藏
页码:808 / 814
页数:7
相关论文
共 30 条
[1]  
[Anonymous], 2009, J XIAMEN U NATURAL S, DOI DOI 10.1360/972009-1549
[2]  
[Anonymous], 2000, CONLL
[3]  
[Anonymous], 2008, P 25 INT C MACH LEAR
[4]  
[Anonymous], P 7 ANN C NEUR INF P
[5]  
[Anonymous], 2008, P 14 ACM SIGKDD INT
[6]  
[Anonymous], 1998, TECH REP
[7]   An overview of MetaMap: historical perspective and recent advances [J].
Aronson, Alan R. ;
Lang, Francois-Michel .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (03) :229-236
[8]   SUPPORT-VECTOR NETWORKS [J].
CORTES, C ;
VAPNIK, V .
MACHINE LEARNING, 1995, 20 (03) :273-297
[9]   GENERALIZED ITERATIVE SCALING FOR LOG-LINEAR MODELS [J].
DARROCH, JN ;
RATCLIFF, D .
ANNALS OF MATHEMATICAL STATISTICS, 1972, 43 (05) :1470-&
[10]   Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010 [J].
de Bruijn, Berry ;
Cherry, Colin ;
Kiritchenko, Svetlana ;
Martin, Joel ;
Zhu, Xiaodan .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2011, 18 (05) :557-562