Content-based Document Routing and Index Partitioning for Scalable Similarity-based Searches in a Large Corpus

被引:0
作者
Bhagwat, Deepavali [1 ]
Eshghi, Kave [1 ]
Mehra, Pankaj [1 ]
机构
[1] Univ Calif Santa Cruz, Storage Syst Res Ctr, Santa Cruz, CA 95064 USA
来源
KDD-2007 PROCEEDINGS OF THE THIRTEENTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING | 2007年
关键词
similarity-based search; scalability; index partitioning; distributed indexing; document routing;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a document routing and index partitioning scheme for scalable similarity-based search of documents in a large corpus. We consider the case when similarity-based search is performed by finding documents that have features in common with the query document. While it is possible to store all the features of all the documents in one index, this suffers from obvious scalability problems. Our approach is to partition the feature index into multiple smaller partitions that can be hosted on separate servers, enabling scalable and parallel search execution. When a document is ingested into the repository, a small number of partitions axe chosen to store the features of the document. To perform similarity-based search, also, only a small number of partitions are queried. Our approach is stateless and incremental. The decision as to which partitions the features of the document should be routed to (for storing at ingestion time, and for similarity based search at query time) is solely based on the features of the document. Our approach scales very well. We show that executing similarity-based searches over such a partitioned search space has minimal impact on the precision and recall of search results, even though every search consults less than 3% of the total number of partitions.
引用
收藏
页码:105 / 112
页数:8
相关论文
共 36 条
[1]  
Adya A, 2002, USENIX ASSOCIATION PROCEEDINGS OF THE FIFTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P1
[2]   Compactly encoding unstructured inputs with differential compression [J].
Ajtai, M ;
Burns, R ;
Fagin, R ;
Long, DDE ;
Stockmeyer, L .
JOURNAL OF THE ACM, 2002, 49 (03) :318-367
[3]  
[Anonymous], 2005, HPL200530R1
[4]  
BRIN S, 1995, SIGMOD 95, P398
[5]  
Broder A. Z., 1998, Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, P327, DOI 10.1145/276698.276781
[6]   On the resemblance and containment of documents [J].
Broder, AZ .
COMPRESSION AND COMPLEXITY OF SEQUENCES 1997 - PROCEEDINGS, 1998, :21-29
[7]   Syntactic clustering of the Web [J].
Broder, AZ ;
Glassman, SC ;
Manasse, MS ;
Zweig, G .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1997, 29 (8-13) :1157-1166
[8]   Strategies for cooperative search in distributed databases [J].
Chua, JJ ;
Tischer, PE .
IEEE/WIC INTERNATIONAL CONFERENCE ON INTELLIGENT AGENT TECHNOLOGY, PROCEEDINGS, 2003, :325-328
[9]  
Cooper BF, 2004, LECT NOTES COMPUT SC, V3231, P59
[10]   Pastiche: Making backup cheap and easy [J].
Cox, LR ;
Murray, CD ;
Noble, BD .
USENIX ASSOCIATION PROCEEDINGS OF THE FIFTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, 2002, :285-298