Convolutional Embedding for Edit Distance

被引:16
作者
Dai, Xinyan [1 ]
Yan, Xiao [1 ]
Zhou, Kaiwen [1 ]
Wang, Yuxuan [1 ]
Yang, Han [1 ]
Cheng, James [1 ]
机构
[1] Chinese Univ Hong Kong, Hong Kong, Peoples R China
来源
PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20) | 2020年
关键词
Edit distance; string similarity search; convolutional neural network; metric embedding; STRING SIMILARITY JOINS; PRODUCT QUANTIZATION; ALGORITHM;
D O I
10.1145/3397271.3401045
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Edit-distance-based string similarity search has many applications such as spell correction, data de-duplication, and sequence alignment. However, computing edit distance is known to have high complexity, which makes string similarity search challenging for large datasets. In this paper, we propose a deep learning pipeline (called CNN-ED) that embeds edit distance into Euclidean distance for fast approximate similarity search. A convolutional neural network (CNN) is used to generate fixed-length vector embeddings for a dataset of strings and the loss function is a combination of the triplet loss and the approximation error. To justify our choice of using CNN instead of other structures (e.g., RNN) as the model, theoretical analysis is conducted to show that some basic operations in our CNN model preserve edit distance. Experimental results show that CNN-ED outperforms data-independent CGK embedding and RNN-based GRU embedding in terms of both accuracy and efficiency by a large margin. We also show that string similarity search can be significantly accelerated using CNN-based embeddings, sometimes by orders of magnitude.
引用
收藏
页码:599 / 608
页数:10
相关论文
共 32 条
[11]  
Deng D, 2014, PROC INT CONF DATA, P340, DOI 10.1109/ICDE.2014.6816663
[12]   Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph [J].
Fu, Cong ;
Xiang, Chao ;
Wang, Changxu ;
Cai, Deng .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (05) :461-474
[13]   Optimized Product Quantization for Approximate Nearest Neighbor Search [J].
Ge, Tiezheng ;
He, Kaiming ;
Ke, Qifa ;
Sun, Jian .
2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, :2946-2953
[14]   Product Quantization for Nearest Neighbor Search [J].
Jegou, Herve ;
Douze, Matthijs ;
Schmid, Cordelia .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2011, 33 (01) :117-128
[15]   String Similarity Joins: An Experimental Evaluation [J].
Jiang, Yu ;
Li, Guoliang ;
Feng, Jianhua ;
Li, Wen-Syan .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 7 (08) :625-636
[16]  
Li C., 2007, VLDB, P303
[17]   Pass-Join: A Partition-based Method for Similarity Joins [J].
Li, Guoliang ;
Deng, Dong ;
Wang, Jiannan ;
Feng, Jianhua .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2011, 5 (03) :253-264
[18]   Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs [J].
Malkov, Yu A. ;
Yashunin, D. A. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (04) :824-836
[19]   A FASTER ALGORITHM COMPUTING STRING EDIT DISTANCES [J].
MASEK, WJ ;
PATERSON, MS .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1980, 20 (01) :18-31
[20]   Low distortion embeddings for edit distance [J].
Ostrovsky, Rafail ;
Rabani, Yuval .
JOURNAL OF THE ACM, 2007, 54 (05)