Stemming and n-grams in Spanish:: an evaluation of their impact on information retrieval

被引:8
作者
Figuerola, CG [1 ]
Gómez, R [1 ]
De San Román, EL [1 ]
机构
[1] Univ Salamanca, Dept Informat & Automat, E-37008 Salamanca, Spain
关键词
D O I
10.1177/016555150002600610
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
At some stage, most of the models and techniques implemented in information retrieval use frequency counts of the terms appearing in documents and in queries, However, many words, since they are derived fi om the same stem, have very close semantic content, This makes a grouping of such variants under a single term advisable. Otherwise, dispersal occurs in the calculation of frequency of these terms and it also becomes difficult to compare queries and documents. On the other hand, there are notable differences between different languages in the way of forming derivatives and inflected forms, so that the application of specific techniques can produce unequal results according to the language of the documents and queries. A description is given of tests carried out for documents in Spanish, which involved some stemming techniques widely used in English, as well as the application of n-grams, and the results are compared.
引用
收藏
页码:461 / 467
页数:7
相关论文
共 33 条
[1]  
Abu-Salem H, 1999, J AM SOC INFORM SCI, V50, P524, DOI 10.1002/(SICI)1097-4571(1999)50:6<524::AID-ASI7>3.0.CO
[2]  
2-M
[3]   USE OF AN ASSOCIATION MEASURE BASED ON CHARACTER STRUCTURE TO IDENTIFY SEMANTICALLY RELATED PAIRS OF WORDS AND DOCUMENTS TITLES [J].
ADAMSON, GW ;
BOREHAM, J .
INFORMATION STORAGE AND RETRIEVAL, 1974, 10 (7-8) :253-260
[4]  
AHAMAD F, 1996, J AM SOC INFORM SCI, V47, P909
[5]  
[Anonymous], DICCIONARIO USO ESPA
[6]  
[Anonymous], NIST SPECIAL PUBLICA
[7]  
Buckley C., 1994, NIST SPECIAL PUBLICA
[8]  
CAVNAR WB, 1994, NIST SPECIAL PUBLICA
[9]   GAUGING SIMILARITY WITH N-GRAMS - LANGUAGE-INDEPENDENT CATEGORIZATION OF TEXT [J].
DAMASHEK, M .
SCIENCE, 1995, 267 (5199) :843-848
[10]  
Dawson J. L., 1974, Association for Literary and Linguistic Computing Bulletin, V2, P33