An auto-indexing method for Arabic text

被引:17
作者
Mansour, Nashat [1 ]
Haraty, Ramzi A. [1 ]
Daher, Walid [1 ]
Houri, Manal [1 ]
机构
[1] Lebanese Amer Univ, Div Comp Sci & Math, Beirut 11023801, Lebanon
关键词
arabic text; document auto-indexing; information retrieval; stem words; word spread;
D O I
10.1016/j.ipm.2007.12.007
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This work addresses the information retrieval problem of auto-indexing Arabic documents. Auto-indexing a text document refers to automatically extracting words that are suitable for building an index for the document. In this paper, we propose an auto-indexing method for Arabic text documents. This method is mainly based on morphological analysis and on a technique for assigning weights to words. The morphological analysis uses a number of grammatical rules to extract stem words that become candidate index words. The weight assignment technique computes weights for these words relative to the container document. The weight is based on how spread is the word in a document and not only on its rate of occurrence. The candidate index words are then sorted in descending order by weight so that information retrievers can select the more important index words. We empirically verify the usefulness of our method using several examples. For these examples, we obtained an average recall of 46% and an average precision of 64%. (C) 2008 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1538 / 1545
页数:8
相关论文
共 15 条
[1]  
[Anonymous], P STUD WORKSH 2 M N
[2]  
Billhardt H., 2000, P BCS IRSG 22 ANN C, P105
[3]  
CONVEY J, 1992, ONLINE INFORM RETRIE
[4]  
DAHER W, 2002, ARABIC AUTOINDEXING
[5]  
Deeb E., 1971, NEW ARABIC GRAMMAR
[6]  
FRANZ M, 1998, P 7 TEXT RETR C, P115
[7]  
Gawrysiak P., 2002, P PAKDD TEXT MIN WOR
[8]  
Harter S.P., 1986, Online Information Retrieval: Concepts, Principles, and Techniques
[9]  
Khreisat L., 2006, P 2006 INT C DAT MIN, P78
[10]  
Kindery A., 1996, ARABIC GRAMMAR BOOK