Efficient processing of similarity search on uncertain set-valued data

被引:0
作者
Chen, Ke [1 ]
Hong, Yin-Jie [1 ]
Chen, Gang [1 ]
机构
[1] Department of Computer Science and Technology, Zhejiang University
来源
Ruan Jian Xue Bao/Journal of Software | 2012年 / 23卷 / 06期
关键词
Dynamic programming; Expected similarity; Similarity search; Uncertain set;
D O I
10.3724/SP.J.1001.2012.04110
中图分类号
学科分类号
摘要
Setting similarity search on possible worlds is semantically and computationally different from the traditional technique for sets of certain data. Considering the uncertainty of items of set, i.e. there is a certain probability for an item appearing in a set, the traditional technique used for processing sets is not applicable. This paper brings forward the formulas to measure the expected similarity of the sets based on possible worlds' semantics. In the expected contexts, if the expected similarity of a pair of sets (X, Y) is larger than a given threshold value τ∈(0, 1), this pair could be called as similar set pair. In the normal algorithm, the complexity of the expected similarity of uncertain sets based on possible worlds is of exponential order. This paper has provided new algorithms to calculate expected similarity by dynamic programming. The complexity of these algorithms is of polynomial order and they reduce execution time greatly. The final experiments have indicated the usability and the high performance of the new algorithms. © 2012 ISCAS.
引用
收藏
页码:1588 / 1601
页数:13
相关论文
共 16 条
  • [1] Arasu A., Ganti V., Kaushik R., Efficient exact set-similarity joins, Proc. of the VLDB, pp. 918-929, (2006)
  • [2] Theobald M., Siddharth J., Paepcke A., SpotSigs: Robust and efficient near duplicate detection in large Web collections, Proc. of the SIGIR, pp. 563-570, (2008)
  • [3] Bayardo R.J., Ma Y.M., Srikant R., Scaling up all pairs similarity search, Proc. of the WWW, pp. 131-140, (2007)
  • [4] Chaudhuri S., Ganti V., Kaushik R., A primitive operator for similarity joins in data cleaning, Proc. of the ICDE, pp. 5-17, (2006)
  • [5] Elmagarmid A.K., Ipeirotis P.G., Verykios V.S., Duplicate record detection: A survey, IEEE Trans. on Knowledge and Data Engineering, 19, 1, pp. 1-16, (2007)
  • [6] Sarawagi S., Kirpal A., Efficient set joins on similarity predicates, Proc. of the SIGMOD Conf., pp. 743-754, (2004)
  • [7] Xiao C., Wang W., Lin X.M., Yu J.X., Efficient similarity joins for near duplicate detection, Proc. of the WWW, pp. 131-140, (2008)
  • [8] Gao Q.S., Gao X.Y., Hu Y., A uniform definition of fuzzy set theory and fundamental of probability theory, Journal of Dalian University of Technology, 46, 1, pp. 141-150, (2006)
  • [9] Soliman M.A., Ilyas I.F., Chang K.C., Top-k query processing in uncertain databases, Proc. of the ICDE, pp. 896-905, (2007)
  • [10] Bernecker T., Kriegel H.P., Renz M., Verhein F., Zufle A., Probabilistic frequent itemset mining in uncertain databases, Proc. of the KDD, pp. 119-128, (2009)