Similarity search in sets and categorical data using the signature tree

被引:12
作者
Mamoulis, N [1 ]
Cheung, DW [1 ]
Lian, W [1 ]
机构
[1] Univ Hong Kong, Dept Comp Sci & Informat Syst, Hong Kong, Hong Kong, Peoples R China
来源
19TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS | 2003年
关键词
D O I
10.1109/ICDE.2003.1260783
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data mining applications analyze large collections of set data and high dimensional categorical data. Search on these data types is not restricted to the classic problems of mining association rules and classification, but similarity search is also a frequently applied operation. Access methods for multidimensional numerical data are inappropriate for this problem and specialized indexes are needed. We propose a method that represents set data as bitmaps (signatures) and organizes them into a hierarchical index, suitable for similarity search and other related query types. In contrast to a previous technique, the, signature tree is dynamic and does not rely on hardwired constants. Experiments with synthetic and real datasets show that it is robust to different data characteristics, scalable to the database size and efficient for various queries.
引用
收藏
页码:75 / 86
页数:12
相关论文
共 22 条
[11]  
Ganti Venkatesh., 1999, Int. Conf. Knowledge Discovery and Data Mining, P73, DOI DOI 10.1145/312129.312201
[12]  
GIBSON D, 1998, VLDB J, P311
[13]  
GIONIS A, 2001, SIGMOD C
[14]   ROCK: A robust clustering algorithm for categorical attributes [J].
Guha, S ;
Rastogi, R ;
Shim, K .
15TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 1999, :512-521
[15]  
Guttman A., 1984, SIGMOD Record, V14, P47, DOI 10.1145/971697.602266
[16]  
HELMER S, 1999, 299 U MANNH
[17]   Distance browsing in spatial databases [J].
Hjaltason, GR ;
Samet, H .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 1999, 24 (02) :265-318
[18]  
Jain K, 1988, Algorithms for clustering data
[19]  
Korn F, 1996, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, P215
[20]   High dimensional similarity joins: Algorithms and performance evaluation [J].
Koudas, N ;
Sevcik, KC .
14TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 1998, :466-475