Unsupervised Joint Monolingual Character Alignment and Word Segmentation

被引:0
作者
Teng, Zhiyang [1 ,2 ]
Xiong, Hao [2 ,3 ]
Liu, Qun [2 ,4 ]
机构
[1] Univ Chinese Acad Sci, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Beijing 100864, Peoples R China
[3] Torangetek Informat Technol Beijing Ltd, Beijing, Peoples R China
[4] Dublin City Univ, Ctr Next Generat Localisat, Fac Engn & Comp, Dublin 9, Ireland
来源
CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA, CCL 2014 | 2014年 / 8801卷
关键词
unsupervised word segmentation; word alignment; Gibbs sampling; Pitman-Yor process;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a novel Bayesian model for fully unsupervised word segmentation based on monolingual character alignment Adapted bilingual word alignment models and a Bayesian language model are combined through product of experts to estimate the joint posterior distribution of a monolingual character alignment and the corresponding segmentation. Our approach enhances the performance of conventional hierarchical Pitman-Yor language models with richer character-level features. In the conducted experiments, our model achieves an 88.6% word token f-score on the standard Brent version of the Bernstein-Ratner corpora. Moreover, on standard Chinese segmentation datasets, our method outperforms a baseline model by 1.9-2.9 f-score points.
引用
收藏
页码:1 / 12
页数:12
相关论文
共 19 条
[1]  
Bernstein-Ratner N., 1987, PHONOLOGY PARENT CHI, V6
[2]  
Brody S., 2010, P EMP METH NAT LANG, P1214
[3]  
Brown P. F., 1993, Computational Linguistics, V19, P263
[4]  
Chung Tagyoung., 2009, P 2009 C EMPIRICAL M, P718, DOI DOI 10.3115/1699571.1699606
[5]  
Goldwater S, 2006, COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, P673
[6]   A Bayesian framework for word segmentation: Exploring the effects of context [J].
Goldwater, Sharon ;
Griffiths, Thomas L. ;
Johnson, Mark .
COGNITION, 2009, 112 (01) :21-54
[7]  
Johnson M., 2009, Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, P317, DOI DOI 10.3115/1620754.1620800
[8]  
Liu Zhanyi., 2009, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, V2, P487
[9]   THE CHILD LANGUAGE DATA EXCHANGE SYSTEM [J].
MACWHINNEY, B ;
SNOW, C .
JOURNAL OF CHILD LANGUAGE, 1985, 12 (02) :271-296
[10]  
Mochihashi D., 2009, P JOINT C 47 ANN M A, P100, DOI DOI 10.3115/1687878.1687894