Evaluation of N-Gram Conflation Approaches for Arabic Text Retrieval

被引:18
作者
Ahmed, Farag [1 ]
Nuernberger, Andreas [1 ]
机构
[1] Otto VonGuericke Univ Magdegurg, Data & Knowledge Engn Grp, Dept Tech & Operat Informat Syst, Fac Comp Sci, Magdeburg, Germany
来源
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY | 2009年 / 60卷 / 07期
关键词
NATURAL-LANGUAGE; SEARCH;
D O I
10.1002/asi.21063
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we present a language-independent approach for conflation that does not depend on predefined rules or prior knowledge of the target language. The proposed unsupervised method is based on an enhancement of the pure n-gram model that can group related words based on various string-similarity measures, while restricting the search to specific locations of the target word by taking into account the order of n-grams. We show that the method is effective to achieve high score similarities for all word-form variations and reduces the ambiguity, i.e., obtains a higher precision and recall, compared to pure n-gram-based approaches for English, Portuguese, and Arabic. The proposed method is especially suited for conflation approaches in Arabic, since Arabic is a highly inflectional language. Therefore, we present in addition an adaptive user interface for Arabic text retrieval called "araSearch". araSearch serves as a metasearch interface to existing search engines. The system is able to extend a query using the proposed conflation approach such that additional results for relevant subwords can be found automatically.
引用
收藏
页码:1448 / 1465
页数:18
相关论文
共 49 条
  • [1] USE OF AN ASSOCIATION MEASURE BASED ON CHARACTER STRUCTURE TO IDENTIFY SEMANTICALLY RELATED PAIRS OF WORDS AND DOCUMENTS TITLES
    ADAMSON, GW
    BOREHAM, J
    [J]. INFORMATION STORAGE AND RETRIEVAL, 1974, 10 (7-8): : 253 - 260
  • [2] AHMED F, 2007, 8 INT C INT TEXT PRO
  • [3] AHMED F, 2008, P 1 INT C INF SYST E, P309
  • [4] Al-Fedaghi SabahS., 1989, Proceedings of the 11th National Computer Conference, King Fahd University of Petroleum Minerals, Dhahran, Saudi Arabia, P04
  • [5] [Anonymous], INTERNATIONAL JOURNA
  • [6] [Anonymous], 1992, Information retrieval: Data structures and algorithms
  • [7] On the development of name search techniques for Arabic
    Aqeel, SU
    Beitzel, S
    Jensen, E
    Grossman, D
    Frieder, O
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2006, 57 (06): : 728 - 739
  • [8] BERLIAN V, 2001, 10 INT WORLD WID WEB
  • [9] BORDAG S, 2005, INT C REC ADV NAT LA
  • [10] Buckwalter T., 2002, ARABIC TRANSLITERATI