Evaluation of N-Gram Conflation Approaches for Arabic Text Retrieval

被引:18
作者
Ahmed, Farag [1 ]
Nuernberger, Andreas [1 ]
机构
[1] Otto VonGuericke Univ Magdegurg, Data & Knowledge Engn Grp, Dept Tech & Operat Informat Syst, Fac Comp Sci, Magdeburg, Germany
来源
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY | 2009年 / 60卷 / 07期
关键词
NATURAL-LANGUAGE; SEARCH;
D O I
10.1002/asi.21063
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we present a language-independent approach for conflation that does not depend on predefined rules or prior knowledge of the target language. The proposed unsupervised method is based on an enhancement of the pure n-gram model that can group related words based on various string-similarity measures, while restricting the search to specific locations of the target word by taking into account the order of n-grams. We show that the method is effective to achieve high score similarities for all word-form variations and reduces the ambiguity, i.e., obtains a higher precision and recall, compared to pure n-gram-based approaches for English, Portuguese, and Arabic. The proposed method is especially suited for conflation approaches in Arabic, since Arabic is a highly inflectional language. Therefore, we present in addition an adaptive user interface for Arabic text retrieval called "araSearch". araSearch serves as a metasearch interface to existing search engines. The system is able to extend a query using the proposed conflation approach such that additional results for relevant subwords can be found automatically.
引用
收藏
页码:1448 / 1465
页数:18
相关论文
共 49 条
  • [11] CARLBERGER J, 2001, NODALIDA 01 13 NORD
  • [12] DANG MT, 2005, UNSUPERVISED SEGMENT
  • [13] Darwish K., 2002, Proceedings of SIGIR 2002. Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P261
  • [14] DAVIS MW, 1998, NIST SPECIAL PUBLICA, P385
  • [15] De Roeck AN, 2000, 38TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P199
  • [16] EKMEKCIOGLU FC, 1996, INFORMATION RES NEWS, V2, P2
  • [17] Gelbukh A, 2004, LECT NOTES COMPUT SC, V3287, P432
  • [18] GHAOUI A, 2005, P 8 INT C SPOK LANG
  • [19] GISPERT A, 2006, P 9 INT C SPOK LANG, P1149
  • [20] Greenfeld DA, 1996, IN SESSION-PSYCHOTH, V2, P5