Character contiguity in N-gram-based word matching:: the case for Arabic text searching

被引:15
作者
Mustafa, SH [1 ]
机构
[1] Yarmouk Univ, Fac Informat Technol, Dept Comp Informat Syst, Irbid, Jordan
关键词
N-grams; string matching; text searching; stemming; word conflation;
D O I
10.1016/j.ipm.2004.02.003
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This work assesses the performance of two N-gram matching techniques for Arabic root-driven string searching: contiguous N-grams and hybrid N-grams, combining contiguous and non-contiguous. The two techniques were tested using three experiments involving different levels of textual word stemming, a textual corpus containing about 25 thousand words (with a total size of about 160KB), and a set of 100 query textual words. The results of the hybrid approach showed significant performance improvement over the conventional contiguous approach, especially in the cases where stemming was used. The present results and the inconsistent findings of previous studies raise some questions regarding the efficiency of pure conventional N-gram matching and the ways in which it should be used in languages other than English. (c) 2004 Elsevier Ltd. All rights reserved.
引用
收藏
页码:819 / 827
页数:9
相关论文
共 15 条
  • [1] CAVNAR WB, 1994, P 3 S DOC AN INF RET
  • [2] Darwish K., 2002, Proceedings of SIGIR 2002. Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P261
  • [3] DARWISH K, 2003, THESIS U MARYLAND US
  • [4] DEROECK AN, 2000, P ACL2000 HONG KONG
  • [5] GOWEDER A, 2001, AR NLP WORKSH 39 ANN
  • [6] KIRCHHHOFF K, 2002, NOV APPR AR SPEECH R
  • [7] Larkey L.S., 2002, P 25 ANN INT ACM SIG
  • [8] LARKEY LS, 2002, NAME PROP NAMES AR C
  • [9] MAYFIELD J, 2001, P 10 TEXT RETR C
  • [10] MCNAMEE P, 2002, P 11 TEXT RETR C