Query based Chinese phrase extraction for site search

被引:0
作者
Xu, JF [1 ]
Ye, SZ [1 ]
Li, X [1 ]
机构
[1] Tsing Hua Univ, Dept Elect Engn, Beijing 100084, Peoples R China
来源
WEB INFORMATION SYSTEMS - WISE 2004, PROCEEDINGS | 2004年 / 3306卷
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Word segmentation(WS) is one of the major issues of information processing in character-based languages: for there are no explicit. word boundaries in these languages. Moreover, a combination of multiple continuous words, a phrase, is usually a minimum meaningful unit. Although much work has been done on WS, in site web search. little has been explored to mine site-specific knowledge from user query log for both more accurate WS and better retrieval performance. This paper proposes a novel, statistics- based method to extract phrases based on user query log. The extracted phrases, combined with a general, static dictionary, construct a dynamic, site-specific dictionary. According to the dictionary, web documents are segmented into phrases and words, which are kept as separate index terms to build phrase enhanced index for site search. The experiment result shows that our approach greatly improves the retrieval performance. It also helps to detect many out-of-vocabulary words, such as site-specific phrases, newly created words and names of people and locations, which are difficult to process with a general, static dictionary.
引用
收藏
页码:125 / 134
页数:10
相关论文
共 11 条
  • [1] [Anonymous], ACM T ASIAN LANGUAGE
  • [2] Probabilistic techniques for phrase extraction
    Feng, FF
    Croft, WB
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2001, 37 (02) : 199 - 220
  • [3] Chinese word segmentation and its effect on information retrieval
    Foo, S
    Li, H
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2004, 40 (01) : 161 - 190
  • [4] Using statistical and contextual information to identify two- and three-character words in Chinese text
    Khoo, CSG
    Dai, YB
    Loh, TE
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2002, 53 (05): : 365 - 377
  • [5] LUA K, 1994, COMPUTER PROCESSING, V40, P115
  • [6] NIE JY, 1996, P 19 ANN INT ACM SIG, P225
  • [7] SHIMOHATA S, 1997, P 35 ANN M ASS COMP, P476
  • [8] Takeda Y, 2003, IEICE T INF SYST, VE86D, P1781
  • [9] Yang CC, 2000, J AM SOC INFORM SCI, V51, P340, DOI 10.1002/(SICI)1097-4571(2000)51:4<340::AID-ASI4>3.0.CO
  • [10] 2-I