Neighbor Profile: Bagging Nearest Neighbors for Unsupervised Time Series Mining

被引:8
作者
He, Yuanduo [1 ,2 ]
Chu, Xu [1 ,3 ]
Wang, Yasha [1 ,2 ]
机构
[1] Minist Educ, Key Lab High Confidence Software Technol, Beijing 100871, Peoples R China
[2] Peking Univ, Natl Engn Res Ctr Software Engn, Beijing 100871, Peoples R China
[3] Peking Univ, Sch Elect Engn & Comp Sci, Beijing 100871, Peoples R China
来源
2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020) | 2020年
基金
中国国家自然科学基金;
关键词
Time series; motif; discord; nearest neighbor; unsupervised mining; ALGORITHMS;
D O I
10.1109/ICDE48307.2020.00039
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Unsupervised time series mining has been attracting great interest from both academic and industrial communities. As the two most basic data mining tasks, the discoveries of frequent/rare subsequences have been extensively studied in the literature. Specifically, frequent/rare subsequences are defined as the ones with the smallest/largest 1-nearest neighbor distance, which are also known as motif/discord. However, discord fails to identify rare subsequences when it occurs more than once in the time series, which is widely known as the twin freak problem. This problem is just the "tip of the iceberg" due to the 1-nearest neighbor distance based definitions. In this work, we for the first time provide a clear theoretical analysis of motif/discord as the 1-nearest neighbor based nonparametric density estimation of subsequence. Particularly, we focus on matrix profile, a recently proposed mining framework, which unifies the discovery of motif and discord under the same computing model. Thereafter, we point out the inherent three issues: low-quality density estimation, gravity defiant behavior, and lack of reusable model, which deteriorate the performance of matrix profile in both efficiency and subsequence quality. To overcome these issues, we propose Neighbor Profile to robustly model the subsequence density by bagging nearest neighbors for the discovery of frequent/rare subsequences. Specifically, we leverage multiple subsamples and average the density estimations from subsamples using adjusted nearest neighbor distances, which not only enhances the estimation robustness but also realizes a reusable model for efficient learning. We check the sanity of neighbor profile on synthetic data and further evaluate it on real-world datasets. The experimental results demonstrate that neighbor profile can correctly model the subsequences of different densities and shows superior performance significantly over matrix profile on the real-world arrhythmia dataset. Also, it is shown that neighbor profile is efficient for massive datasets.
引用
收藏
页码:373 / 384
页数:12
相关论文
共 43 条
[1]  
Aggarwal C.C., 2015, ACM SIGKDD EXPLORATI, V17, P24
[2]  
Angiulli F., 2002, Principles of Data Mining and Knowledge Discovery. 6th European Conference, PKDD 2002. Proceedings (Lecture Notes in Artificial Intelligence Vol.2431), P15
[3]  
[Anonymous], 2012, AM ACAD SLEEP MED
[4]  
[Anonymous], 2018, Density estimation for statistics and data analysis, DOI DOI 10.1201/9781315140919
[5]  
[Anonymous], 2000, Pattern Classification, DOI DOI 10.1007/978-3-319-57027-3_4
[6]  
[Anonymous], 2005, P 5 IEEE INT C DAT M
[7]   Isolation-based anomaly detection using nearest-neighbor ensembles [J].
Bandaragoda, Tharindu R. ;
Ting, Kai Ming ;
Albrecht, David ;
Liu, Fei Tony ;
Zhu, Ye ;
Wells, Jonathan R. .
COMPUTATIONAL INTELLIGENCE, 2018, 34 (04) :968-998
[8]  
Borgwardt K, 2013, ADV NEURAL INFORM PR, P467
[9]  
Breiman L, 1996, MACH LEARN, V24, P123, DOI 10.1007/BF00058655
[10]   LOF: Identifying density-based local outliers [J].
Breunig, MM ;
Kriegel, HP ;
Ng, RT ;
Sander, J .
SIGMOD RECORD, 2000, 29 (02) :93-104