Normalized Compression Distance of Multisets with Applications

被引:29
作者
Cohen, Andrew R. [1 ]
Vitanyi, Paul M. B. [2 ,3 ]
机构
[1] Drexel Univ, Dept Elect & Comp Engn, Philadelphia, PA 19104 USA
[2] Natl Res Ctr Math & Comp Sci, Amsterdam, Netherlands
[3] Univ Amsterdam, NL-1098 XG Amsterdam, Netherlands
基金
美国国家卫生研究院;
关键词
Normalized compression distance; multisets or multiples; pattern recognition; data mining; similarity; classification; Kolmogorov complexity; retinal progenitor cells; synthetic data; organelle transport; handwritten character recognition; INFORMATION DISTANCE; TRANSPORT;
D O I
10.1109/TPAMI.2014.2375175
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pairwise normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity metric based on compression. We propose an NCD of multisets that is also metric. Previously, attempts to obtain such an NCD failed. For classification purposes it is superior to the pairwise NCD in accuracy and implementation complexity. We cover the entire trajectory from theoretical underpinning to feasible practice. It is applied to biological (stem cell, organelle transport) and OCR classification questions that were earlier treated with the pairwise NCD. With the new method we achieved significantly better results. The theoretic foundation is Kolmogorov complexity.
引用
收藏
页码:1602 / 1614
页数:13
相关论文
共 33 条
  • [11] Regularized linear discriminant analysis and its application in microarrays
    Guo, Yaqian
    Hastie, Trevor
    Tibshirani, Robert
    [J]. BIOSTATISTICS, 2007, 8 (01) : 86 - 100
  • [12] Kamvar S. D., 2003, IJCAI'03, P561
  • [13] Keogh Eamonn, 2004, P 10 ACM SIGKDD INT, P206, DOI [10.1145/1014052.1014077, DOI 10.1145/1014052.1014077]
  • [14] Information theory-based software metrics and obfuscation
    Kirk, SR
    Jenkins, S
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2004, 72 (02) : 179 - 186
  • [15] Application of compression-based distance measures to protein sequence classification:: a methodological study
    Kocsor, A
    Kertész-Farkas, A
    Kaján, L
    Pongor, S
    [J]. BIOINFORMATICS, 2006, 22 (04) : 407 - 412
  • [16] 3 APPROACHES TO QUANTITATIVE DEFINITION OF INFORMATION
    KOLMOGOROV, AN
    [J]. INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS, 1968, 2 (02) : 157 - +
  • [17] Kraft L. G., THESIS MIT CAMBRIDGE
  • [18] Gradient-based learning applied to document recognition
    Lecun, Y
    Bottou, L
    Bengio, Y
    Haffner, P
    [J]. PROCEEDINGS OF THE IEEE, 1998, 86 (11) : 2278 - 2324
  • [19] Levin L. A., 1974, Problems of Information Transmission, V10, P206
  • [20] The similarity metric
    Li, M
    Chen, X
    Li, X
    Ma, B
    Vitányi, PMB
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2004, 50 (12) : 3250 - 3264