Dimensionality reduction for similarity search with the Euclidean distance in high-dimensional applications

被引:7
作者
Jeong, Seungdo [1 ]
Kim, Sang-Wook [1 ]
Choi, Byung-Uk [1 ]
机构
[1] Hanyang Univ, Dept Elect & Comp Engn, Sch Informat & Commun, Seoul 133791, South Korea
关键词
Multimedia information retrieval; High-dimensional indexing; Dimensionality reduction; Similarity search; QUERIES; INDEX;
D O I
10.1007/s11042-008-0243-y
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In multimedia information retrieval, multimedia data are represented as vectors in high-dimensional space. To search these vectors efficiently, a variety of indexing methods have been proposed. However, the performance of these indexing methods degrades dramatically with increasing dimensionality, which is known as the dimensionality curse. To resolve the dimensionality curse, dimensionality reduction methods have been proposed. They map feature vectors in high-dimensional space into vectors in low-dimensional space before the data are indexed. This paper proposes a novel method for dimensionality reduction based on a function that approximates the Euclidean distance based on the norm and angle components of a vector. First, we identify the causes of, and discuss basic solutions to, errors in angle approximation during the approximation of the Euclidean distance. Then, this paper propose a new method for dimensionality reduction that extracts a set of subvectors from a feature vector and maintains only the norm and the approximated angle for every subvector. The selection of a good reference vector is crucial for accurate approximation of the angle component. We present criteria for being a good reference vector, and propose a method that chooses a good reference vector. Also, we define a novel distance function using the norm and angle components, and formally prove that the distance function consistently lower-bounds the Euclidean distance. This implies information retrieval with this function does not incur any false dismissals. Finally, the superiority of the proposed approach is verified via extensive experiments with synthetic and real-life data sets.
引用
收藏
页码:251 / 271
页数:21
相关论文
共 34 条
[1]  
Aggarwal C.C., 2001, P ACM SIGACT SIGMOD, P256
[2]  
Agrawal R., 1993, Proceedings of the International Conference on Foundations of Data Organization and Algorithms, Chicago, IL, P69
[3]  
[Anonymous], P ACM SIGMOD C
[4]  
[Anonymous], P ACM INT C MAN DAT
[5]  
BECKMANN N, 1990, SIGMOD REC, V19, P322, DOI 10.1145/93605.98741
[6]  
BERCHTOLD S, 1997, P ACM SIGMOD INT C M, P1
[7]  
Beyer K., 1999, P 7 INT C DAT THEOR, P217, DOI DOI 10.1007/3-540-49257-7_15
[8]   Searching in high-dimensional spaces -: Index structures for improving the performance of multimedia Databases [J].
Böhm, C ;
Berchtold, S ;
Keim, D .
ACM COMPUTING SURVEYS, 2001, 33 (03) :322-373
[9]  
Ciaccia P, 1997, PROCEEDINGS OF THE TWENTY-THIRD INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, P426
[10]   Dimensionality reduction and similarity computation by inner-product approximations [J].
Egecioglu, Ö ;
Ferhatosmanoglu, H ;
Ogras, U .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2004, 16 (06) :714-726