Efficient Top-k Similarity Join of Massive Time Series Using MapReduce

被引:0
作者
Chen, Dehua [1 ]
Shen, Changgan [1 ]
Li, Yue [1 ]
Le, Jiajin [1 ]
Rong, Chunming [2 ]
机构
[1] Donghua Univ, Sch Comp Sci & Technol, Shanghai, Peoples R China
[2] Univ Stavanger, Dept Elect Engn & Comp Sci, Stavanger, Norway
来源
JOURNAL OF INTERNET TECHNOLOGY | 2014年 / 15卷 / 06期
关键词
Massive time series; Top-k similarity join; MapReduce; Parallel computation;
D O I
10.6138/JIT.2014.15.6.13
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Top-k similarity join of time series, designed to find top-k most similar pairs of time series records, is a: primitive operation widely adopted by many time series data analysis applications. However, computing such top-k similarity join is a challenging problem today, as many modern applications are creating massive amounts of time series data. Obviously, a centralized machine is difficult to perform top-k similarity join in a large time series database efficiently. In this paper, we investigate how to perform the top-k similarity join of massive time series in parallel using MapReduce over a large cluster of commodity machines. Our proposed MapReduce-based algorithm consists of four steps, which takes as input a set of time series records and output an ordered list of top k closest pairs. To improve the efficiency in computing top-k similarity join, we proposed several solutions. We first introduce an efficient distance function based on LSH (Locality Sensitive Hash) for time series to improve the efficiency in pairwise similarity comparison. We next propose all pair partitioning methods to minimize the amount of data transfers between map and reduce functions. Moreover, we make use of serial computation strategy for parallelizing the computation of local top-k closest pairs in each partition. Our performance study confirms the effectiveness and scalability of our MapReduce algorithms.
引用
收藏
页码:1025 / 1032
页数:8
相关论文
共 28 条
[1]  
Agrawal R., 1993, Proceedings of the International Conference on Foundations of Data Organization and Algorithms, Chicago, IL, P69
[2]  
Andoni Alexander, 2011, NEAR OPTIMAL HASHING, V51, P117
[3]  
Berchtold S, 1996, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, P28
[4]  
Berndt D.J., 1996, FINDING PATTERNS TIM, P229
[5]   Efficient processing of spatial joins using R-trees [J].
Brinkhoff, Thomas ;
Kriegel, Hans-Peter ;
Seeger, Bernhard .
SIGMOD Record, 1993, 22 (02) :237-246
[6]  
Chen L, 2004, P 30 INT C VER LARG, V30, P792, DOI [DOI 10.1016/B978-012088469-8.50070-X, 10.5555/1316689.1316758, DOI 10.5555/1316689.1316758]
[7]  
Cranor Chuck., 2003, ACM SIGMOD
[8]  
Datar M., 2004, PROC 20 ANN S COMPUT, P253
[9]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[10]   PARALLEL DATABASE-SYSTEMS - THE FUTURE OF HIGH-PERFORMANCE DATABASE-SYSTEMS [J].
DEWITT, D ;
GRAY, J .
COMMUNICATIONS OF THE ACM, 1992, 35 (06) :85-98