A detection framework for semantic code clones and obfuscated code

被引:29
作者
Sheneamer, Abdullah [1 ,2 ]
Roy, Swarup [3 ,4 ]
Kalita, Jugal [2 ]
机构
[1] Jazan Univ, Fac Comp Sci & Informat Syst, Jazan 45142, Saudi Arabia
[2] Univ Colorado, Coll Engn & Appl Sci, Colorado Springs, CO 80918 USA
[3] Sikkim Univ, Dept Comp Applicat, Sikkim 737102, Gangtok, India
[4] North Eastern Hill Univ, Dept Informat Technol, Shillong 793022, Meghalayn, India
关键词
Code obfuscation; Semantic code clones; Machine learning; Bytecode dependency graph; Program dependency graph; ACCURATE;
D O I
10.1016/j.eswa.2017.12.040
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Code obfuscation is a staple tool in malware creation where code fragments are altered substantially to make them appear different from the original, while keeping the semantics unaffected. A majority of the obfuscated code detection methods use program structure as a signature for detection of unknown codes. They usually ignore the most important feature, which is the semantics of the code, to match two code fragments or programs for obfuscation. Obfuscated code detection is a special case of the semantic code clone detection task. We propose a detection framework for detecting both code obfuscation and clone using machine learning. We use features extracted from Java bytecode dependency graphs (BDG), program dependency graphs (PDG) and abstract syntax trees (AST). BDGs and PDGs are two representations of the semantics or meaning of a Java program. ASTs capture the structural aspects of a program. We use several publicly available code clone and obfuscated code datasets to validate the effectiveness of our framework. We use different assessment parameters to evaluate the detection quality of our proposed model. Experimental results are excellent when compared with contemporary obfuscated code and code clone detectors. Interestingly, we achieve 100% success in detecting obfuscated code based on recall, precision, and F1-Score. When we compare our method with other methods for all of obfuscations types, viz, contraction, expansion, loop transformation and renaming, our model appears to be the winner. In case of clone detection our model achieve very high detection accuracy in comparison to other similar detectors. (C) 2017 Elsevier Ltd. All rights reserved.
引用
收藏
页码:405 / 420
页数:16
相关论文
共 54 条
[11]  
Fan RE, 2008, J MACH LEARN RES, V9, P1871
[12]   THE PROGRAM DEPENDENCE GRAPH AND ITS USE IN OPTIMIZATION [J].
FERRANTE, J ;
OTTENSTEIN, KJ ;
WARREN, JD .
ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS, 1987, 9 (03) :319-349
[13]   Additive logistic regression: A statistical view of boosting - Rejoinder [J].
Friedman, J ;
Hastie, T ;
Tibshirani, R .
ANNALS OF STATISTICS, 2000, 28 (02) :400-407
[14]  
Gunter C. A., 1992, Semantics of programming languages: structures and techniques
[15]  
Higo Y., 2011, 2011 18th Working Conference on Reverse Engineering, P3, DOI 10.1109/WCRE.2011.11
[16]  
Ho TK, 1998, IEEE T PATTERN ANAL, V20, P832, DOI 10.1109/34.709601
[17]  
Horwitz S., 1988, Conference Record of the Fifteenth Annual ACM Symposium on Principles of Programming Languages, P146, DOI 10.1145/73560.73573
[18]  
Hotta K., 2014, ELECT COMMUN EASST, V63, P1
[19]  
Hummel Benjamin., 2010, Software Maintenance, IEEE International Conference on, P1
[20]  
Jiang LX, 2007, PROC INT CONF SOFTW, P96