Applying machine learning to text segmentation for information retrieval

被引:43
作者
Huang, XJ
Peng, FC
Schuurmans, D
Cercone, N
Robertson, SE
机构
[1] Univ Waterloo, Sch Comp Sci, Waterloo, ON N2L 3G1, Canada
[2] City Univ London, London EC1V 0HB, England
[3] Microsoft Res Ltd, Cambridge, England
来源
INFORMATION RETRIEVAL | 2003年 / 6卷 / 3-4期
关键词
machine learning; word segmentation; EM algorithm; Chinese information retrieval;
D O I
10.1023/A:1026028229881
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose a self-supervised word segmentation technique for text segmentation in Chinese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments on TREC data show this method is promising. Our method is completely language independent and unsupervised, which provides a promising avenue for constructing accurate multi-lingual or cross-lingual information retrieval systems that are flexible and adaptive. We find that although the segmentation accuracy of self-supervised segmentation is not as high as some other segmentation methods, it is enough to give good retrieval performance. It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. However, for Chinese, we find that the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this effect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%, including 70% word segmentation accuracy from our self-supervised word-segmentation approach. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserved by accurate segmenters, while they are broken up by less accurate ( but reasonable) segmenters, to a surprising advantage. This suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text. Our research suggests machine learning techniques can play an important role in building adaptable information retrieval systems and different evaluation standards for word segmentation should be given to different applications.
引用
收藏
页码:333 / 362
页数:30
相关论文
共 43 条
[1]  
[Anonymous], P 5 INT WORKSH INF R
[2]  
[Anonymous], P TREC 5
[3]  
[Anonymous], 1990, Text Compression
[4]  
BRENT M, 2001, P ACL2001 FRANC
[5]  
BUCKLEY C, 1998, P 6 TEXT RETR C TREC, P107
[6]  
Chang J.-S., 1997, International Journal of Computational Linguistics and Chinese Language Processing
[7]   Dynamic behavior of steel frames with beam flanges shaved around connection [J].
Chen, SJ ;
Chu, JM ;
Chou, ZL .
JOURNAL OF CONSTRUCTIONAL STEEL RESEARCH, 1997, 42 (01) :49-70
[8]  
Chien LF, 1997, PROCEEDINGS OF THE 20TH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P50, DOI 10.1145/278459.258534
[9]   DATA-COMPRESSION USING ADAPTIVE CODING AND PARTIAL STRING MATCHING [J].
CLEARY, JG ;
WITTEN, IH .
IEEE TRANSACTIONS ON COMMUNICATIONS, 1984, 32 (04) :396-402
[10]   On the discovery of novel wordlike units from utterances: An artificial-language study with implications for native-language acquisition [J].
Dahan, D ;
Brent, MR .
JOURNAL OF EXPERIMENTAL PSYCHOLOGY-GENERAL, 1999, 128 (02) :165-185