Program Characterization Using Runtime Values and Its Application to Software Plagiarism Detection

被引:32
作者
Jhi, Yoon-Chan [1 ]
Jia, Xiaoqi [2 ]
Wang, Xinran [3 ]
Zhu, Sencun [4 ]
Liu, Peng [5 ]
Wu, Dinghao [5 ]
机构
[1] Samsung SDS R&D Ctr, Seoul, South Korea
[2] Chinese Acad Sci, State Key Lab Informat Secur, Inst Informat Engn, Beijing 100193, Peoples R China
[3] Shape Secur, Mountain View, CA 94040 USA
[4] Penn State Univ, Dept Comp Sci & Engn, University Pk, PA 16802 USA
[5] Penn State Univ, Coll Informat Sci & Technol, University Pk, PA 16802 USA
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Software plagiarism detection; dynamic code identification; CODE;
D O I
10.1109/TSE.2015.2418777
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Illegal code reuse has become a serious threat to the software community. Identifying similar or identical code fragments becomes much more challenging in code theft cases where plagiarizers can use various automated code transformation or obfuscation techniques to hide stolen code from being detected. Previous works in this field are largely limited in that (i) most of them cannot handle advanced obfuscation techniques, and (ii) the methods based on source code analysis are not practical since the source code of suspicious programs typically cannot be obtained until strong evidences have been collected. Based on the observation that some critical runtime values of a program are hard to be replaced or eliminated by semantics-preserving transformation techniques, we introduce a novel approach to dynamic characterization of executable programs. Leveraging such invariant values, our technique is resilient to various control and data obfuscation techniques. We show how the values can be extracted and refined to expose the critical values and how we can apply this runtime property to help solve problems in software plagiarism detection. We have implemented a prototype with a dynamic taint analyzer atop a generic processor emulator. Our value-based plagiarism detection method (VaPD) uses the longest common subsequence based similarity measuring algorithms to check whether two code fragments belong to the same lineage. We evaluate our proposed method through a set of real-world automated obfuscators. Our experimental results show that the value-based method successfully discriminates 34 plagiarisms obfuscated by SandMark, plagiarisms heavily obfuscated by KlassMaster, programs obfuscated by Thicket, and executables obfuscated by Loco/Diablo.
引用
收藏
页码:925 / 943
页数:19
相关论文
共 56 条
[1]  
Alzarooni K., 2012, Malware variant detection
[2]  
[Anonymous], 2012, INT C DETECTION INTR
[3]  
[Anonymous], INT S FUT SOFTW TECH
[4]  
[Anonymous], 2005, NDSS
[5]  
[Anonymous], 1997, 148 U AUCKL
[6]  
[Anonymous], 2012, P 2 ACM C DAT APPL S, DOI DOI 10.1145/2133601.2133640
[7]  
BAKER BS, 1995, SECOND WORKING CONFERENCE ON REVERSE ENGINEERING, PROCEEDINGS, P86, DOI 10.1109/WCRE.1995.514697
[8]   DMS®:: Program transformations for practical scalable software evolution [J].
Baxter, ID ;
Pidgeon, C ;
Mehlich, M .
ICSE 2004: 26TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, PROCEEDINGS, 2004, :625-634
[9]   Clone detection using abstract syntax trees [J].
Baxter, ID ;
Yahin, A ;
Moura, L ;
Sant'Anna, M ;
Bier, L .
INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE, PROCEEDINGS, 1998, :368-377
[10]  
Bellard F, 2005, USENIX Association Proceedings of the FREENIX/Open Source Track, P41