Semantic-Based Representation Binary Clone Detection for Cross-Architectures in the Internet of Things

被引:20
作者
Luo, Zhenhao [1 ]
Wang, Baosheng [1 ]
Tang, Yong [1 ]
Xie, Wei [1 ]
机构
[1] Natl Univ Def Technol, Coll Comp, Changsha 410073, Hunan, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2019年 / 9卷 / 16期
基金
中国国家自然科学基金;
关键词
binary clone detection; Semantic representation; cross-architectures; IoT devices; real-world vulnerabilities; CODE; SOFTWARE;
D O I
10.3390/app9163283
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Code reuse is widespread in software development as well as internet of things (IoT) devices. However, code reuse introduces many problems, e.g., software plagiarism and known vulnerabilities. Solving these problems requires extensive manual reverse analysis. Fortunately, binary clone detection can help analysts mitigate manual work by matching reusable code and known parts. However, many binary clone detection methods are not robust to various compiler optimization options and different architectures. While some clone detection methods can be applied across different architectures, they rely on manual features based on human prior knowledge to generate feature vectors for assembly functions and fail to consider the internal associations between features from a semantic perspective. To address this problem, we propose and implement a prototype GeneDiff, a semantic-based representation binary clone detection approach for cross-architectures. GeneDiff utilizes a representation model based on natural language processing (NLP) to generate high-dimensional numeric vectors for each function based on the Valgrind intermediate representation (VEX) representation. This is the first work that translates assembly instructions into an intermediate representation and uses a semantic representation model to implement clone detection for cross-architectures. GeneDiff is robust to various compiler optimization options and different architectures. Compared to approaches using symbolic execution, GeneDiff is significantly more efficient and accurate. The area under the curve (AUC) of the receiver operating characteristic (ROC) of GeneDiff reaches 92.35%, which is considerably higher than the approaches that use symbolic execution. Extensive experiments indicate that GeneDiff can detect similarity with high accuracy even when the code has been compiled with different optimization options and targeted to different architectures. We also use real-world IoT firmware across different architectures as targets, therein proving the practicality of GeneDiff in being able to detect known vulnerabilities.
引用
收藏
页数:21
相关论文
共 27 条
  • [1] [Anonymous], OSDI
  • [2] [Anonymous], 2013, ADV NEURAL INF PROCE
  • [3] Jumping NLP Curves: A Review of Natural Language Processing Research
    Cambria, Erik
    White, Bebo
    [J]. IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, 2014, 9 (02) : 48 - 57
  • [4] BinGo: Cross-Architecture Cross-OS Binary Search
    Chandramohan, Mahinthan
    Xue, Yinxing
    Xu, Zhengzi
    Liu, Yang
    Cho, Chia Yuan
    Kuan, Tan Hee Beng
    [J]. FSE'16: PROCEEDINGS OF THE 2016 24TH ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON FOUNDATIONS OF SOFTWARE ENGINEERING, 2016, : 678 - 689
  • [5] David Y, 2016, ACM SIGPLAN NOTICES, V51, P266, DOI [10.1145/2980983.2908126, 10.1145/2908080.2908126]
  • [6] David Y, 2014, ACM SIGPLAN NOTICES, V49, P349, DOI [10.1145/2666356.2594343, 10.1145/2594291.2594343]
  • [7] Ding S. H., 2019, P IEEE S SEC PRIV, P38
  • [8] Eschweiler S., 2016, P 23 S NETW DISTR SY
  • [9] Gao DB, 2008, LECT NOTES COMPUT SC, V5308, P238
  • [10] Jhi YC, 2011, 2011 33RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), P756, DOI 10.1145/1985793.1985899