Unsupervised joint monolingual character alignment and word segmentation

被引:1
作者
Teng, Zhiyang [1 ,2 ]
Xiong, Hao [2 ,3 ]
Liu, Qun [2 ,4 ]
机构
[1] Institute of Computing Technology, Chinese Academy of Sciences
[2] Centre for Next Generation Localisation, Dublin City University
来源
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | 2014年 / 8801卷
基金
中国国家自然科学基金;
关键词
Gibbs sampling; Pitman-yor process; Unsupervised word segmentation; Word alignment;
D O I
10.1007/978-3-319-12277-9_1
中图分类号
学科分类号
摘要
We propose a novel Bayesian model for fully unsupervised word segmentation based on monolingual character alignment. Adapted bilingual word alignment models and a Bayesian language model are combined through product of experts to estimate the joint posterior distribution of a monolingual character alignment and the corresponding segmentation. Our approach enhances the performance of conventional hierarchical Pitman-Yor language models with richer character-level features. In the conducted experiments, our model achieves an 88.6% word token f-score on the standard Brent version of the Bemstein-Ratner corpora. Moreover, on standard Chinese segmentation datasets, our method outperforms a baseline model by 1.9-2.9 f-score points. © Springer International Publishing Switzerland 2014.
引用
收藏
页码:1 / 12
页数:11
相关论文
共 19 条
[11]  
Bernstein-Ratner N., The phonology of parent-child speech, (1987)
[12]  
Xu J., Gao J., Toutanova K., Ney H., Bayesian semi-supervised Chinese word segmentation for statistical machine translation, Proceedings of COLING, COLING 2008, pp. 1017-1024, (2008)
[13]  
Nguyen T., Vogel S., Smith N.A., Nonparametric word segmentation for machine translation, Proceedings of COLING, COLING 2010, pp. 815-823, (2010)
[14]  
Chung T., Gildea D., Unsupervised tokenization for machine translation, Proceedings of EMNLP, EMNLP 2009, pp. 718-726, (2009)
[15]  
Pitman J., Yor M., The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator, (1995)
[16]  
Goldwater S., Griffiths T.L., Johnson M., A bayesian framework for word segmentation: Exploring the effects of Context, Cognition, 112, pp. 21-54, (2009)
[17]  
Och F.J., Ney H., Josef F., Ney O.H., A systematic comparison of various statistical alignment models, Computational Linguistics, (2003)
[18]  
Tom E., Second international Chinese word segmentation bakeoff, (2005)
[19]  
Macwhinney B., Snow C., Et al., The child language data exchange system, Journal of Child Language, 12, pp. 271-296, (1985)