Similarity search in sets and categorical data using the signature tree

被引:12
作者
Mamoulis, N [1 ]
Cheung, DW [1 ]
Lian, W [1 ]
机构
[1] Univ Hong Kong, Dept Comp Sci & Informat Syst, Hong Kong, Hong Kong, Peoples R China
来源
19TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS | 2003年
关键词
D O I
10.1109/ICDE.2003.1260783
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data mining applications analyze large collections of set data and high dimensional categorical data. Search on these data types is not restricted to the classic problems of mining association rules and classification, but similarity search is also a frequently applied operation. Access methods for multidimensional numerical data are inappropriate for this problem and specialized indexes are needed. We propose a method that represents set data as bitmaps (signatures) and organizes them into a hierarchical index, suitable for similarity search and other related query types. In contrast to a previous technique, the, signature tree is dynamic and does not rely on hardwired constants. Experiments with synthetic and real datasets show that it is robust to different data characteristics, scalable to the database size and efficient for various queries.
引用
收藏
页码:75 / 86
页数:12
相关论文
共 22 条
[1]  
Aggarwal CC, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P407, DOI 10.1145/304181.304218
[2]  
Agrawal R., 1994, P 20 INT C VER LARG, V1215, P487
[3]  
[Anonymous], 1995, SIGMOD
[4]  
Beyer K, 1999, LECT NOTES COMPUT SC, V1540, P217
[5]  
BRINKHOFF T, 1993, SIGMOD C, P237
[6]  
CORRAL A, 2000, SIGMOD C, P189
[7]  
DEPPISCH U, 1986, ACM SIGIR C, P77
[8]  
DEVRIES AP, 2002, SIGMOD C, P322
[9]  
Faloutsos, 1994, P 20 INT C VER LARG, P500
[10]   Multidimensional access methods [J].
Gaede, V ;
Gunther, O .
ACM COMPUTING SURVEYS, 1998, 30 (02) :170-231