Enhancing LSTM-based Word Segmentation Using Unlabeled Data

被引:3
作者
Zheng, Bo [1 ]
Che, Wanxiang [1 ]
Guo, Jiang [1 ]
Liu, Ting [1 ]
机构
[1] Harbin Inst Technol, Res Ctr Social Comp & Informat Retrieval, Harbin, Heilongjiang, Peoples R China
来源
CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA, CCL 2017 | 2017年 / 10565卷
基金
中国国家自然科学基金;
关键词
Word segmentation; Statistics-based features; Neural network; Unlabeled data;
D O I
10.1007/978-3-319-69005-6_6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Word segmentation problem is widely solved as the sequence labeling problem. The traditional way to this kind of problem is machine learning method like conditional random field with hand-crafted features. Recently, deep learning approaches have achieved state-of-the-art performance on word segmentation task and a popular method of them is LSTM networks. This paper gives a method to introduce numerical statistics-based features counted on unlabeled data into LSTM networks and analyzes how it enhances the performance of word segmentation model. We add pre-trained character-bigram embedding, pointwise mutual information, accessor variety and punctuation variety into our model and compare their performances on different datasets including three datasets from CoNLL-2017 shared task and three datasets of simplified Chinese. We achieve the state-of-the-art performance on two of them and get comparable results on the rest.
引用
收藏
页码:60 / 70
页数:11
相关论文
共 16 条
[1]  
[Anonymous], 2005, P 4 SIGHAN WORKSH CH
[2]  
[Anonymous], 2009, P HUMAN LANGUAGE TEC
[3]  
Cai D, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P409
[4]   Accessor variety criteria for Chinese word extraction [J].
Feng, HD ;
Chen, K ;
Deng, XT ;
Zheng, WM .
COMPUTATIONAL LINGUISTICS, 2004, 30 (01) :75-93
[5]  
Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[6]  
Kong Lingpeng, 2015, ARXIV151106018
[7]  
Lafferty John, 2001, INT C MACH LEARN ICM
[8]  
Liang P., 2005, SEMISUPERVISED LEARN
[9]  
Liu Y., 2016, P 25 INT JOINT C ART, P2880
[10]  
Mikolov T., 2013, ARXIV, DOI [10.48550/arXiv.1301.3781, DOI 10.48550/ARXIV.1301.3781]