A cognitive inspired unsupervised language-independent text stemmer for Information retrieval

被引:16
作者
Alotaibi, Fahd Saleh [1 ]
Gupta, Vishal [2 ]
机构
[1] King Abdulaziz Univ, Fac Comp & Informat Technol, Jeddah, Saudi Arabia
[2] Panjab Univ Chandigarh, Univ Inst Engn & Technol, Dept Comp Sci & Engn, Chandigarh, India
关键词
Morphology; Stemming; Stemmer; Language-independent stemming; Information Retrieval; Corpus-Based Stemming; ALGORITHM;
D O I
10.1016/j.cogsys.2018.07.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In Information Retrieval systems, stemming handles the words that can occur in different morphological forms, and hence matches the terms of the documents and the queries that are related in meanings. In this article, we have proposed a cognitive inspired language-independent stemming that learns group of morphologically related words from the ambient corpus without any linguistic knowledge or human intervention and it behaves in a way the human brain works. The main idea of our proposed algorithm is to determine only those variants of the words from the ambient corpus that match the original intent of the query terms. We conducted ad-hoc retrieval experiments in a number of languages of varying morphological complexity using standard TREC, FIRE, and CLEF document collection. The results indicate that stemming improves the retrieval accuracy and the effectiveness of stemming algorithm increases with the increase in the morphological complexity of algorithm. The results also indicates that the performance of our proposed algorithm is better than the stemmers based on linguistic knowledge and other state-of-the-art statistical stemmers in almost all the languages under study. In multi-lingual setup these results are quite encouraging. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:291 / 300
页数:10
相关论文
共 36 条
[1]  
[Anonymous], 2006, P ACM SIGIR 06 WORKS, DOI DOI 10.1007/978-3-540-31865-1_37
[2]   A probabilistic model for stemmer generation [J].
Bacchin, M ;
Ferro, N ;
Melucci, M .
INFORMATION PROCESSING & MANAGEMENT, 2005, 41 (01) :121-137
[3]  
Bacchin M, 2002, LECT NOTES COMPUT SC, V2555, P117
[4]   Stemming via distribution-based word segregation for classification and retrieval [J].
Bhamidipati, Narayan L. ;
Pal, Sankar K. .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2007, 37 (02) :350-360
[5]   HPS: High precision stemmer [J].
Brychcin, Tomas ;
Konopik, Miloslav .
INFORMATION PROCESSING & MANAGEMENT, 2015, 51 (01) :68-91
[6]   Morphological Cluster Induction of Bantu Words Using a Weighted Similarity Measure [J].
Chavula, Catherine ;
Suleman, Hussein .
SOUTH AFRICAN INSTITUTE OF COMPUTER SCIENTISTS AND INFORMATION TECHNOLOGISTS (SACSIT 2017), 2017, :49-57
[7]  
Creutz Mathis, 2007, ACM Transactions on Speech and Language Processing, V4, P1, DOI DOI 10.1145/1187415.1187418
[8]  
Dolamic L., 2010, ACM Transactions on Asian Language Information Processing (TALIP), V9, P1, DOI [10.1145/1838745.1838748, DOI 10.1145/1838745.1838748]
[9]   Indexing and Searching Strategies for the Russian Language [J].
Dolamic, Ljiljana ;
Savoy, Jacques .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2009, 60 (12) :2540-2547
[10]   Indexing and stemming approaches for the Czech language [J].
Dolamic, Ljiljana ;
Savoy, Jacques .
INFORMATION PROCESSING & MANAGEMENT, 2009, 45 (06) :714-720