On normalized compression distance and large malware Towards a useful definition of normalized compression distance for the classification of large files

被引:20
作者
Borbely, Rebecca Schuller [1 ]
机构
[1] CyberPoint Int, 621 E Pratt St,Suite 300, Baltimore, MD 21202 USA
关键词
Clustering algorithms;
D O I
10.1007/s11416-015-0260-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Normalized Compression Distance (NCD) is a popular tool that uses compression algorithms to cluster and classify data in a wide range of applications. Existing discussions of NCD's theoretical merit rely on certain theoretical properties of compression algorithms. However, we demonstrate that many popular compression algorithms do not seem to satisfy these theoretical properties. We explore the relationship between some of these properties and file size, demonstrate that this theoretical problem is actually a practical problem for classifying malware with large file sizes, and propose some variants of NCD that mitigate this problem.
引用
收藏
页码:235 / 242
页数:8
相关论文
共 22 条
[1]  
Bailey M, 2007, LECT NOTES COMPUT SC, V4637, P178
[2]  
Bloom C., PPMZ HIGH COMPRESSIO
[3]  
Cebrian M, 2005, COMMUN INF SYST, V5, P367
[4]   Shared information and program plagiarism detection [J].
Chen, X ;
Francia, B ;
Li, M ;
McKinnon, B ;
Seker, A .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2004, 50 (07) :1545-1551
[5]   Clustering by compression [J].
Cilibrasi, R ;
Vitányi, PMB .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2005, 51 (04) :1523-1545
[6]   Algorithmic clustering of music [J].
Cilibrasi, R ;
Vitányi, P ;
de Wolf, R .
PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON WEB DELIVERING OF MUSIC, 2004, :110-117
[7]  
Cilibrasi R., COMPLEARN
[8]   NEAREST NEIGHBOR PATTERN CLASSIFICATION [J].
COVER, TM ;
HART, PE .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) :21-+
[9]  
Dandu Ravi Varma, 2008, Indian J Radiol Imaging, V18, P287, DOI 10.4103/0971-3026.43838
[10]  
Gailly J., ZLIB MASSIVELY SPIFF