A MAPREDUCE BASED DISTRIBUTED LSI FOR SCALABLE INFORMATION RETRIEVAL

被引:0
作者
Liu, Yang [1 ]
Li, Maozhen [2 ,3 ]
Khan, Mukhtaj [2 ]
Qi, Man [4 ]
机构
[1] Sichuan Univ, Sch Elect Engn & Informat, Chengdu, Peoples R China
[2] Brunel Univ, Sch Engn & Design, Uxbridge UB8 3PH, Middx, England
[3] Tongji Univ, Key Lab Embedded Syst & Serv Comp, Shanghai, Peoples R China
[4] Canterbury Christ Church Univ, Dept Comp, Canterbury CT1 1QU, Kent, England
关键词
Information retrieval; latent semantic indexing; Map Reduce; load balancing; genetic algorithms;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Latent Semantic Indexing (LSI) has been widely used in information retrieval due to its efficiency in solving the problems of polysemy and synonymy. However, LSI is notably a computationally intensive process because of the computing complexities of singular value decomposition and filtering operations involved in the process. This paper presents MR-LSI, a Map Reduce based distributed LSI algorithm for scalable information retrieval. The performance of MR-LSI is first evaluated in a small scale experimental cluster environment, and subsequently evaluated in large scale simulation environments. By partitioning the dataset into smaller subsets and optimizing the partitioned subsets across a cluster of computing nodes, the overhead of the MR-LSI algorithm is reduced significantly while maintaining a high level of accuracy in retrieving documents of user interest. A genetic algorithm based load balancing scheme is designed to optimize the performance of MR-LSI in heterogeneous computing environments in which the computing nodes have varied resources.
引用
收藏
页码:259 / 280
页数:22
相关论文
共 41 条
[31]  
Oksa G, 2002, P ALGORITMY 2002 C S, P113
[32]  
Park H., 2003, MATRIX RANK REDUCTIO
[33]  
PAVLO A, 2009, P 35 SIGMOD INT C MA
[34]   Parallelization of a dynamic SVD clustering algorithm and its application in information retrieval [J].
Seshadri, Karthick ;
Iyer, K. Viswanathan .
SOFTWARE-PRACTICE & EXPERIENCE, 2010, 40 (10) :883-896
[35]  
Song W., P ADV LANG PROC WEB, P21
[36]  
Steinbach M., 2000, KDD 2000 WORKSH TEXT
[37]  
Tarpey T, 1996, STAT SCI, V11, P229
[38]  
Taura K, 2003, ACM SIGPLAN NOTICES, V38, P215, DOI 10.1145/966049.781533
[39]  
Venner Jason., 2009, PRO HADOOP
[40]  
Wang G., LSAP 09