Unknown malcode detection and the imbalance problem

被引:34
作者
Moskovitch R. [1 ]
Stopel D. [1 ]
Feher C. [1 ]
Nissim N. [1 ]
Japkowicz N. [2 ]
Elovici Y. [1 ]
机构
[1] Deutsche Telekom Laboratories, Department of Information Systems Engineering, Ben Gurion University
[2] School of Information Technology and Engineering, University of Ottawa, Ottawa
来源
Journal in Computer Virology | 2009年 / 5卷 / 04期
关键词
Malware - Text processing;
D O I
10.1007/s11416-009-0122-8
中图分类号
学科分类号
摘要
The recent growth in network usage has motivated the creation of new malicious code for various purposes. Today's signature-based antiviruses are very accurate for known malicious code, but can not detect new malicious code. Recently, classification algorithms were used successfully for the detection of unknown malicious code. But, these studies involved a test collection with a limited size and the same malicious: benign file ratio in both the training and test sets, a situation which does not reflect real-life conditions. We present a methodology for the detection of unknown malicious code, which examines concepts from text categorization, based on n-grams extraction from the binary code and feature selection. We performed an extensive evaluation, consisting of a test collection of more than 30,000 files, in which we investigated the class imbalance problem. In real-life scenarios, the malicious file content is expected to be low, about 10% of the total files. For practical purposes, it is unclear as to what the corresponding percentage in the training set should be. Our results indicate that greater than 95% accuracy can be achieved through the use of a training set that has a malicious file content of less than 33.3%. © Springer-Verlag France 2009.
引用
收藏
页码:295 / 308
页数:13
相关论文
共 32 条
[1]  
Filiol E., Josse S., A statistical model for undecidable viral detection, J. Comput. Virol., 3, pp. 65-74, (2007)
[2]  
Filiol E., Malware pattern scanning schemes secure against black-box analysis, J. Comput. Virol., 2, pp. 35-50, (2006)
[3]  
Gryaznov D., Scanners of the year 2000: Heuritics, Proceedings of the 5th International Virus Bulletin, (1999)
[4]  
Schultz M., Eskin E., Zadok E., Stolfo S., Data mining methods for detection of new malicious executables, Proceedings of the IEEE Symposium on Security and Privacy, pp. 178-184, (2001)
[5]  
Abou-Assaleh T., Cercone N., Keselj V., Sweidan R., N-gram based detection of new malicious code, Proceedings of the 28th Annual International Computer Software and Applications Conference (COMPSAC'04), (2004)
[6]  
Kolter J.Z., Maloof M.A., Learning to detect malicious executables in the wild, Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470-478, (2004)
[7]  
Mitchell T., Machine Learning, (1997)
[8]  
Henchiri O., Japkowicz N., A feature selection and evaluation scheme for computer virus detection, Proceedings of ICDM-2006, pp. 891-895, (2006)
[9]  
Reddy D., Pujari A., N-gram analysis for computer virus detection, J. Comput. Virol., 2, pp. 231-239, (2006)
[10]  
Kubat M., Matwin S., Addressing the curse of imbalanced data sets: One-sided sampling, Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179-186, (1997)