Data smashing: uncovering lurking order in data

被引:12
作者
Chattopadhyay, Ishanu [1 ,2 ]
Lipson, Hod [3 ,4 ]
机构
[1] Univ Chicago, Computat Inst, Chicago, IL 60637 USA
[2] Cornell Univ, Dept Comp Sci, Sch Mech & Aerosp Engn, Ithaca, NY 14853 USA
[3] Cornell Univ, Sch Mech & Aerosp Engn, Ithaca, NY USA
[4] Cornell Univ, Ithaca, NY USA
关键词
feature-free classification; universal metric; probabilistic automata; SET;
D O I
10.1098/rsif.2014.0826
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
From automatic speech recognition to discovering unusual stars, underlying almost all automated discovery tasks is the ability to compare and contrast data streams with each other, to identify connections and spot outliers. Despite the prevalence of data, however, automated methods are not keeping pace. A key bottleneck is that most data comparison algorithms today rely on a human expert to specifywhat 'features' of the data are relevant for comparison. Here, we propose a new principle for estimating the similarity between the sources of arbitrary data streams, using neither domain knowledge nor learning. We demonstrate the application of this principle to the analysis of data from a number of real-world challenging problems, including the disambiguation of electro-encephalograph patterns pertaining to epileptic seizures, detection of anomalous cardiac activity from heart sound recordings and classification of astronomical objects from raw photometry. In all these cases and without access to any domain knowledge, we demonstrate performance on a par with the accuracy achieved by specialized algorithms and heuristics devised by domain experts. We suggest that data smashing principles may open the door to understanding increasingly complex observations, especially when experts do not know what to look for.
引用
收藏
页数:11
相关论文
共 36 条
[1]   Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state [J].
Andrzejak, RG ;
Lehnertz, K ;
Mormann, F ;
Rieke, C ;
David, P ;
Elger, CE .
PHYSICAL REVIEW E, 2001, 64 (06) :8-061907
[2]  
[Anonymous], ENGL LANG SPEECH DAT
[3]  
[Anonymous], PATTERN CLASSIFICATI
[4]  
[Anonymous], P COMP VIS PATT REC
[5]  
[Anonymous], 1973, Pattern Classification and Scene Analysis
[6]  
[Anonymous], 200505 IMM DTU
[7]   A bit level representation for time series data mining with shape based similarity [J].
Bagnall, Anthony ;
Ratanamahatana, Chotirat 'Ann' ;
Keogh, Eamonn ;
Lonardi, Stefano ;
Janacek, Gareth .
DATA MINING AND KNOWLEDGE DISCOVERY, 2006, 13 (01) :11-40
[8]   More Is Less: Signal Processing and the Data Deluge [J].
Baraniuk, Richard G. .
SCIENCE, 2011, 331 (6018) :717-719
[9]  
Begleiter H., 1995, Eeg database data set
[10]  
Bentley P, 2011, CLASSIFYING HEART SO