DexBERT: Effective, Task-Agnostic and Fine-Grained Representation Learning of Android Bytecode

被引:6
作者
Sun, Tiezhu [1 ]
Allix, Kevin [1 ]
Kim, Kisub [2 ]
Zhou, Xin [2 ]
Kim, Dongsun [3 ]
Lo, David [2 ]
Bissyande, Tegawende F. [1 ]
Klein, Jacques [1 ]
机构
[1] Univ Luxembourg, L-1359 Kirchberg, Luxembourg
[2] Singapore Management Univ, Singapore 188065, Singapore
[3] Kyungpook Natl Univ, Daegu 41566, South Korea
基金
新加坡国家研究基金会;
关键词
Representation learning; Android app analysis; code representation; malicious code localization; defect prediction; MALWARE DETECTION; DEFECT PREDICTION; MALICIOUS CODE; PERFORMANCE; FRAMEWORK;
D O I
10.1109/TSE.2023.3310874
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The automation of an increasingly large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). One foundational building block in the application of ML to software artifacts is the representation of these artifacts (e.g., source code or executable code) into a form that is suitable for learning. Traditionally, researchers and practitioners have relied on manually selected features, based on expert knowledge, for the task at hand. Such knowledge is sometimes imprecise and generally incomplete. To overcome this limitation, many studies have leveraged representation learning, delegating to ML itself the job of automatically devising suitable representations and selections of the most relevant features. Yet, in the context of Android problems, existing models are either limited to coarse-grained whole-app level (e.g., apk2vec) or conducted for one specific downstream task (e.g., smali2vec). Thus, the produced representation may turn out to be unsuitable for fine-grained tasks or cannot generalize beyond the task that they have been trained on. Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture information relevant to various low-level downstream tasks (e.g., at the class-level). We are inspired by the field of Natural Language Processing, where the problem of universal representation was addressed by building Universal Language Models, such as BERT, whose goal is to capture abstract semantic information about sentences, in a way that is reusable for a variety of tasks. We propose DexBERT, a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in three distinct classlevel software engineering tasks: Malicious Code Localization, Defect Prediction, and Component Type Classification. We also experiment with strategies to deal with the problem of catering to apps having vastly different sizes, and we demonstrate one example of using our technique to investigate what information is relevant to a given task.
引用
收藏
页码:4691 / 4706
页数:16
相关论文
共 93 条
[1]   A Survey of Machine Learning for Big Code and Naturalness [J].
Allamanis, Miltiadis ;
Barr, Earl T. ;
Devanbu, Premkumar ;
Sutton, Charles .
ACM COMPUTING SURVEYS, 2018, 51 (04)
[2]  
Allix K, 2016, 13TH WORKING CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR 2016), P468, DOI [10.1109/MSR.2016.056, 10.1145/2901739.2903508]
[3]   Empirical assessment of machine learning-based malware detectors for Android Measuring the gap between in-the-lab and in-the-wild validation scenarios [J].
Allix, Kevin ;
Bissyande, Tegawende F. ;
Jerome, Quentin ;
Klein, Jacques ;
State, Radu ;
Le Traon, Yves .
EMPIRICAL SOFTWARE ENGINEERING, 2016, 21 (01) :183-211
[4]   code2vec: Learning Distributed Representations of Code [J].
Alon, Uri ;
Zilberstein, Meital ;
Levy, Omer ;
Yahav, Eran .
PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2019, 3 (POPL)
[5]   Drebin: Effective and Explainable Detection of Android Malware in Your Pocket [J].
Arp, Daniel ;
Spreitzenbarth, Michael ;
Huebner, Malte ;
Gascon, Hugo ;
Rieck, Konrad .
21ST ANNUAL NETWORK AND DISTRIBUTED SYSTEM SECURITY SYMPOSIUM (NDSS 2014), 2014,
[6]   AndroAnalyzer: android malicious software detection based on deep learning [J].
Arslan, Recep Sinan .
PEERJ COMPUTER SCIENCE, 2021,
[7]  
Atici MA, 2016, 2016 4TH INTERNATIONAL SYMPOSIUM ON DIGITAL FORENSIC AND SECURITY (ISDFS), P26, DOI 10.1109/ISDFS.2016.7473512
[8]  
Bamman D., 2013, arXiv
[9]  
Bhatia T, 2017, 2017 INTERNATIONAL CONFERENCE ON CYBER SECURITY AND PROTECTION OF DIGITAL SERVICES (CYBER SECURITY), DOI 10.1109/CyberSecPODS.2017.8074847
[10]  
Booz J, 2018, 2018 19TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), P140, DOI 10.1109/SNPD.2018.8441128