Unsupervised joint monolingual character alignment and word segmentation

被引:1
作者
Teng, Zhiyang [1 ,2 ]
Xiong, Hao [2 ,3 ]
Liu, Qun [2 ,4 ]
机构
[1] Institute of Computing Technology, Chinese Academy of Sciences
[2] Centre for Next Generation Localisation, Dublin City University
来源
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | 2014年 / 8801卷
基金
中国国家自然科学基金;
关键词
Gibbs sampling; Pitman-yor process; Unsupervised word segmentation; Word alignment;
D O I
10.1007/978-3-319-12277-9_1
中图分类号
学科分类号
摘要
We propose a novel Bayesian model for fully unsupervised word segmentation based on monolingual character alignment. Adapted bilingual word alignment models and a Bayesian language model are combined through product of experts to estimate the joint posterior distribution of a monolingual character alignment and the corresponding segmentation. Our approach enhances the performance of conventional hierarchical Pitman-Yor language models with richer character-level features. In the conducted experiments, our model achieves an 88.6% word token f-score on the standard Brent version of the Bemstein-Ratner corpora. Moreover, on standard Chinese segmentation datasets, our method outperforms a baseline model by 1.9-2.9 f-score points. © Springer International Publishing Switzerland 2014.
引用
收藏
页码:1 / 12
页数:11
相关论文
共 19 条
[1]  
Wang H., Zhu J., Tang S., Fan X., A new unsupervised approach to word segmentation, CL, 37, pp. 421-454, (2011)
[2]  
Sun M., Shen D., Tsou B.K., Chinese word segmentation without using lexicon and hand-crafted training data, Proceedings of the Joint Conference of ACL and COLING, Montreal, pp. 1265-1271, (1998)
[3]  
Goldwater S., Griffiths T.L., Johnson M., Contextual dependencies in unsupervised word segmentation, Proceedings of the Joint Conference of ACL and COLING, ACL-44, Stroudsburg, pp. 673-680, (2006)
[4]  
Mochihashi D., Yamada T., Ueda N., Bayesian unsupervised word segmentation with nested pitman-yor language modeling, Proceedings of the Joint Conference of ACL and IJCNLP, ACL 2009, pp. 100-108, (2009)
[5]  
Johnson M., Goldwater S., Improving nonparameteric bayesian inference: Experiments on unsupervised word segmentation with adaptor grammars, Proceedings of Human Language Technologies: The 2009 NAACL, NAACL 2009, pp. 317-325, (2009)
[6]  
Liu Z., Wang H., Wu H., Li S., Collocation extraction using monolingual word alignment method, Proceedings of EMNLP, pp. 487-495, (2009)
[7]  
Brody S., It depends on the translation: Unsupervised dependency parsing via word alignment, Proceedings of EMNLP, EMNLP 2010, pp. 1214-1222, (2010)
[8]  
Brown P.F., Pietra V.J.D., Pietra S.A.D., Mercer R.L., The mathematics of statistical machine translation: Parameter estimation, Comput. Linguist, 19, pp. 263-311, (1993)
[9]  
Vogel S., Ney H., Tillmann C., Hmm-based word alignment in statistical translation, Proceedings of COLING, COLING 1996, pp. 836-841, (1996)
[10]  
Teh Y.W., A hierarchical bayesian language model based on pitman-yor processes, Proceedings of the Joint Conference of ACL and COLING, ACL-44, pp. 985-992, (2006)