Unsupervised joint monolingual character alignment and word segmentation

被引：1

作者：

Teng, Zhiyang ^{[1
,2
]}

Xiong, Hao ^{[2
,3
]}

Liu, Qun ^{[2
,4
]}

机构：

[1] Institute of Computing Technology, Chinese Academy of Sciences

[2] Centre for Next Generation Localisation, Dublin City University

来源：

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | 2014年 / 8801卷

基金：

中国国家自然科学基金;

关键词：

Gibbs sampling; Pitman-yor process; Unsupervised word segmentation; Word alignment;

D O I：

10.1007/978-3-319-12277-9_1

中图分类号：

学科分类号：

摘要：

We propose a novel Bayesian model for fully unsupervised word segmentation based on monolingual character alignment. Adapted bilingual word alignment models and a Bayesian language model are combined through product of experts to estimate the joint posterior distribution of a monolingual character alignment and the corresponding segmentation. Our approach enhances the performance of conventional hierarchical Pitman-Yor language models with richer character-level features. In the conducted experiments, our model achieves an 88.6% word token f-score on the standard Brent version of the Bemstein-Ratner corpora. Moreover, on standard Chinese segmentation datasets, our method outperforms a baseline model by 1.9-2.9 f-score points. © Springer International Publishing Switzerland 2014.

引用

页码：1 / 12

页数：11

共 19 条

[11]

Bernstein-Ratner N., The phonology of parent-child speech, (1987)

[12]

Xu J., Gao J., Toutanova K., Ney H., Bayesian semi-supervised Chinese word segmentation for statistical machine translation, Proceedings of COLING, COLING 2008, pp. 1017-1024, (2008)

[13]

Nguyen T., Vogel S., Smith N.A., Nonparametric word segmentation for machine translation, Proceedings of COLING, COLING 2010, pp. 815-823, (2010)

[14]

Chung T., Gildea D., Unsupervised tokenization for machine translation, Proceedings of EMNLP, EMNLP 2009, pp. 718-726, (2009)

[15]

Pitman J., Yor M., The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator, (1995)

[16]

Goldwater S., Griffiths T.L., Johnson M., A bayesian framework for word segmentation: Exploring the effects of Context, Cognition, 112, pp. 21-54, (2009)

[17]

Och F.J., Ney H., Josef F., Ney O.H., A systematic comparison of various statistical alignment models, Computational Linguistics, (2003)

[18]

Tom E., Second international Chinese word segmentation bakeoff, (2005)

[19]

Macwhinney B., Snow C., Et al., The child language data exchange system, Journal of Child Language, 12, pp. 271-296, (1985)

← 1 2 →