Using statistical and contextual information to identify two- and three-character words in Chinese text

被引:11
作者
Khoo, CSG [1 ]
Dai, YB
Loh, TE
机构
[1] Nanyang Technol Univ, Sch Commun & Informat, Div Informat Studies, Singapore 637718, Singapore
[2] Data Storage Inst, Singapore, Singapore
来源
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY | 2002年 / 53卷 / 05期
关键词
D O I
10.1002/asi.10045
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
New statistical formulas were developed for identifying two- and three-character words in Chinese text. The formulas were constructed by performing stepwise logistic regression using a sample of sentences that had been manually segmented. For identifying two-character words, the relative frequency of the adjacent characters and the document frequency of the overlapping bigrams were found to be significant factors. These provide information about the immediate neighborhood or context of the character string. Contextual information was also found to be significant in predicting three-character words. Local information (the number of times the bigram or trigram occurs in the document being segmented) and the position of the bigram/trigram in the sentence were not found to be useful in identifying words. The new formulas, called contextual information formulas, were found to be substantially better than the mutual information formula in identifying two- and three-character words. Using the contextual information formulas for both two- and three-character words gave significantly better results than using the formula for two-character words alone. The method can also be used for identifying multiword terms in English text.
引用
收藏
页码:365 / 377
页数:13
相关论文
共 39 条
[1]  
ALLEN J, 1995, NATURAL LANGUAGE UND
[2]  
[Anonymous], MODERN CHINESE WORD
[3]  
Bian GW, 2000, J AM SOC INFORM SCI, V51, P281, DOI 10.1002/(SICI)1097-4571(2000)51:3<281::AID-ASI7>3.0.CO
[4]  
2-8
[5]  
CHANG CH, 1993, COMMUNICATIONS COLIP, V3, P69
[6]  
CHANG JS, 1994, COMPUTER PROCESSING, V8, P75
[7]  
DAI JC, 1994, COMPUTER PROCESSING, V8, P1
[8]   GENERATING AND EVALUATING DOMAIN-ORIENTED MULTI-WORD TERMS FROM TEXTS [J].
DAMERAU, FJ .
INFORMATION PROCESSING & MANAGEMENT, 1993, 29 (04) :433-447
[9]  
Frantzi K. T., 1998, Research and Advanced Technology for Digital Libraries. Second European Conference, ECDL'98. Proceedings, P585
[10]  
FUJIMOTO K, 1999, P 17 IASTED INT C AP, P667