PDF Malware Detection based on Stacking Learning

被引:19
作者
Issakhani, Maryam [1 ]
Victor, Princy [1 ]
Tekeoglu, Ali [2 ]
Lashkari, Arash Habibi [1 ]
机构
[1] Univ New Brunswick UNB, Canadian Inst Cybersecur CIC, Fredericton, NB, Canada
[2] Johns Hopkins Univ, Appl Phys Lab, Crit Infrastruct Protect Grp, Baltimore, MD 21218 USA
来源
PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS SECURITY AND PRIVACY (ICISSP) | 2021年
关键词
PDF; PDF Malware; Evasive PDF Malware; Malware Detection; Stacking; Machine Learning; Deep Learning;
D O I
10.5220/0010908400003120
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Over the years, Portable Document Format (PDF) has become the most popular content presenting format among users due to its flexibility and easy-to-work features. However, advanced features such as JavaScript or file embedding make them an attractive target to exploit by attackers. Due to the complex PDF structure and sophistication of attacks, traditional detection approaches such as Anti-Viruses can detect only specific types of threats as they rely on signature-based techniques. Even though state-of-the-art researches utilize AI technology for a higher PDF Malware detection rate, the evasive malicious PDF files are still a security threat. This paper proposes a framework to address this gap by extracting 28 static representative features from PDF files with 12 being novel,and feeding to the stacking ML models for detecting evasive malicious PDF files. We evaluated our solution on two different datasets, Contagio and a newly generated evasive PDF dataset (Evasive-PDFMal2022). In the first evaluation, we achieved accuracy and F1-score of 99.89% and 99.86%, which outperforms the existing models. Then, we re-evaluated the proposed model using the newly generated evasive PDF dataset (Evasive-PDFMal2022)as an improved version of Contagio. As a result, we achieved 98.69% and 98.77% as accuracy and F1-scores, demonstrating the effectiveness of our proposed model. A comparison with state-of-the-art methods proves that our proposed work is more resilient to detect evasive malicious PDF files.
引用
收藏
页码:562 / 570
页数:9
相关论文
共 20 条
[1]  
Blonce A., 2008, EUR BLACKHAT C CIT, P20
[2]   Extract Me If You Can: Abusing PDF Parsers in Malware Detectors [J].
Carmony, Curtis ;
Zhang, Mu ;
Hu, Xunchao ;
Bhaskar, Abhishek Vasisht ;
Yin, Heng .
23RD ANNUAL NETWORK AND DISTRIBUTED SYSTEM SECURITY SYMPOSIUM (NDSS 2016), 2016,
[3]  
Corona Igino., 2014, P 2014 WORKSHOP ARTI, P47, DOI [DOI 10.1145/2666652.2666657, 10.1145/2666652.2666657]
[4]  
Cuan B., 2018, Malware detection in pdf files using machine learning
[5]  
Cui Y., 2020, IEEE SENS J
[6]  
Fettaya R., 2020, ARXIV PREPRINT ARXIV
[7]  
Itabashi K., 2011, PORTABLE DOCUMENT FO
[8]  
Jeong Y.-S., 2019, SECUR COMMUN NETW, V2019
[9]   A feature-vector generative adversarial network for evading PDF malware classifiers [J].
Li, Yuanzhang ;
Wang, Yaxiao ;
Wang, Ye ;
Ke, Lishan ;
Tan, Yu-an .
INFORMATION SCIENCES, 2020, 523 :38-48
[10]   Detecting Malicious Java']Javascript in PDF through Document Instrumentation [J].
Liu, Daiping ;
Wang, Haining ;
Stavrou, Angelos .
2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2014, :100-111