Refining High-frequency-queries-based Filter for Similarity Join

被引:0
作者
Chongstitvatana, Jaruloj [1 ]
Thitinanrungkit, Natthee [1 ]
机构
[1] Chulalongkorn Univ, Fac Sci, Dept Math & Comp Sci, Bangkok 10330, Thailand
来源
2015 INTERNATIONAL COMPUTER SCIENCE AND ENGINEERING CONFERENCE (ICSEC) | 2015年
关键词
similarity join; filter-and-verfication approach; high-frequency queries;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Similarity join and similarity search are important for text databases and data cleaning. Filter-and-verification are applied to reduce the processing time for similarity join and similarity search. High-frequency-queries-based filter partitions a dataset according to the similarity between a data record and a chosen high-frequency-query, and these partitions are stored in a similarity table. In the filter process, data in some rows of a similarity table are selected as candidates. Many high-frequency queries can be used to improve the pruning power. However, the time to choose an appropriate high-frequency query - i.e. to choose an appropriate similarity table - increases with the number of high-frequency queries. This paper proposes a refinement of high-frequency-queries-based filter to reduce the filter time and the number of candidates. To reduce the filter time, inverted lists of high-frequency queries are used to speed up the token counting, which reduces the time for choosing an appropriate similarity table. Binary search in each rows of a similarity table is applied to further eliminate non-candidates. It is shown from the experiments that the refined filter method takes less time and gives better pruning power than the original method.
引用
收藏
页码:68 / 72
页数:5
相关论文
共 7 条
  • [1] [Anonymous], 2012, P 2012 INT C MANAGEM, DOI DOI 10.1145/2213836.2213847
  • [2] Augsten N., 2013, Synth. Lect. Data Manage., V5, P1
  • [3] Chaudhuri Surajit, 2006, 22 INT C DAT ENG ICD, V5, DOI [10.1109/ICDE.2006.9, DOI 10.1109/ICDE.2006.9]
  • [4] Hadjieleftheriou M., 2011, Foundations and Trends in Databases, V2, P267
  • [5] Kunanusont Kamolwan, 2014, 2014 International Computer Science and Engineering Conference (ICSEC), P415, DOI 10.1109/ICSEC.2014.6978233
  • [6] Kunanusont K., 2015, P 12 INT C EL ENG EL
  • [7] Xiao C., 2011, P 17 INT C WORLD WID, P131