Clustering of time-series subsequences is meaningless: implications for previous and future research

被引：225

作者：

Keogh, E ^{[1
]}

Lin, J ^{[1
]}

机构：

[1] Univ Calif Riverside, Dept Comp Sci & Engn, Riverside, CA 92521 USA

来源：

KNOWLEDGE AND INFORMATION SYSTEMS | 2005年 / 8卷 / 02期

关键词：

clustering; data mining; rule discovery; subsequence; time series;

D O I：

10.1007/s10115-004-0172-7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Given the recent explosion of interest in streaming data and online algorithms, clustering of time-series subsequences, extracted via a sliding window, has received much attention. In this work, we make a surprising claim. Clustering of time-series subsequences is meaningless. More concretely, clusters extracted from these time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random. While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has never appeared in the literature. We can justify calling our claim surprising because it invalidates the contribution of dozens of previously published papers. We will justify our claim with a theorem, illustrative examples, and a comprehensive set of experiments on reimplementations of previous work. Although the primary contribution of our work is to draw attention to the fact that an apparent solution to an important problem is incorrect and should no longer be used, we also introduce a novel method that, based on the concept of time-series motifs, is able to meaningfully cluster subsequences on some time-series datasets.

引用

页码：154 / 177

页数：24

共 52 条

[21]

Jensen David, 2000, SIGKDD Explorations Newsletter, V1, P52, DOI [10.1145/846183.846195, DOI 10.1145/846183.846195]

[22]

Jin X., 2002, P 6 PAC AS C KNOWL D, P469

[23]

JIN X, 2002, P 3 INT C INT DAT EN, P68

[24]

Kendall M., 1976, TIME SERIES

[25] Clustering of time series subsequences is meaningless: Implications for previous and future research [J].

Keogh, E ;

Lin, J ;

Truppel, W .

THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2003, :115-122

[26]

Keogh E., 2002, Proceedings of the Twenty-eighth International Conference on Very Large Data Bases, P406

[27] Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases [J].

Eamonn Keogh ;

Kaushik Chakrabarti ;

Michael Pazzani ;

Sharad Mehrotra .

Knowledge and Information Systems, 2001, 3 (3) :263-286

[28]

Keogh E., 2002, P 8 ACM SIGKDD INT C, P102

[29]

LI CS, 1998, P 7 INT C INF KNOWL, P267

[30]

LIN J, 2002, 2 WORKSH TEMP DAT MI

← 1 2 3 4 5 6 →