SFEM: Structural feature extraction methodology for the detection of malicious office documents using machine learning methods

被引:55
作者
Cohen, Aviad [1 ,2 ]
Nissim, Nir [1 ,2 ]
Rokach, Lior [1 ,2 ]
Elovici, Yuval [1 ,2 ]
机构
[1] Ben Gurion Univ Negev, Dept Informat Syst Engn, IL-84105 Beer Sheva, Israel
[2] Ben Gurion Univ Negev, Cyber Secur Res Ctr, Malware Lab, IL-84105 Beer Sheva, Israel
关键词
Machine learning; Malware detection; Static analysis; Structural features; Microsoft office open xml; Document; MALWARE DETECTION; PDF FILES; CLASSIFICATION;
D O I
10.1016/j.eswa.2016.07.010
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Office documents are used extensively by individuals and organizations. Most users consider these documents safe for use. Unfortunately, Office documents can contain malicious components and perform harmful operations. Attackers increasingly take advantage of naive users and leverage Office documents in order to launch sophisticated advanced persistent threat (APT) and ransomware attacks. Recently, targeted cyber-attacks against organizations have been initiated with emails containing malicious attachments. Since most email servers do not allow the attachment of executable files to emails, attackers prefer to use of non-executable files (e.g., documents) for malicious purposes. Existing anti-virus engines primarily use signature-based detection methods, and therefore fail to detect new unknown malicious code which has been embedded in an Office document. Machine learning methods have been shown to be effective at detecting known and unknown malware in various domains, however, to the best of our knowledge, machine learning methods have not been used for the detection of malicious XML-based Office documents (*.docx, *.xlsx, *.pptx, *.odt, *.ods, etc.). In this paper we present a novel structural feature extraction methodology (SFEM) for XML-based Office documents. SFEM extracts discriminative features from documents, based on their structure. We leveraged SFEM's features with machine learning algorithms for effective detection of malicious *.docx documents. We extensively evaluated SFEM with machine learning classifiers using a representative collection (16,938 *.docx documents collected "from the wild") which contains 4.9% malicious and similar to 95.1% benign documents. We examined 1,600 unique configurations based on different combinations of feature extraction, feature selection, feature representation, top-feature selection methods, and machine learning classifiers. The results show that machine learning algorithms trained on features provided by SFEM successfully detect new unknown malicious *.docx documents. The Random Forest classifier achieves the highest detection rates, with an AUC of 99.12% and true positive rate (TPR) of 97% that is accompanied by a false positive rate (FPR) of 4.9%. In comparison, the best anti-virus engine achieves a TPR which is 25% lower. (C) 2016 Elsevier Ltd. All rights reserved.
引用
收藏
页码:324 / 343
页数:20
相关论文
共 50 条
  • [31] Malicious File Detection Method Using Machine Learning and Interworking with MITRE ATT&CK Framework
    Ahn, Gwanghyun
    Kim, Kookjin
    Park, Wonhyung
    Shin, Dongkyoo
    APPLIED SCIENCES-BASEL, 2022, 12 (21):
  • [32] Malware Detection and Classification in Android Application Using Simhash-Based Feature Extraction and Machine Learning
    Al-Kahla, Wafaa
    Taqieddin, Eyad
    Shatnawi, Ahmed S.
    Al-Ouran, Rami
    IEEE ACCESS, 2024, 12 : 174255 - 174273
  • [33] Detection of Encrypted Malicious Network Traffic using Machine Learning
    De Lucia, Michael J.
    Cotton, Chase
    MILCOM 2019 - 2019 IEEE MILITARY COMMUNICATIONS CONFERENCE (MILCOM), 2019,
  • [34] Malicious url detection using machine learning and ensemble modeling
    Pakhare P.S.
    Krishnan S.
    Charniya N.N.
    Lecture Notes on Data Engineering and Communications Technologies, 2021, 66 : 839 - 850
  • [35] Windower: Feature Extraction for Real-Time DDoS Detection Using Machine Learning
    Goldschmidt, Patrik
    Kucera, Jan
    PROCEEDINGS OF 2024 IEEE/IFIP NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM, NOMS 2024, 2024,
  • [36] An Effective Feature Selection Algorithm for Machine Learning-based Malicious Traffic Detection
    Fei, Chao
    Xia, Nian
    Tsai, Pang-Wei
    Lu, Yang
    Pan, Xiaonan
    Gong, Junli
    2024 19TH ASIA JOINT CONFERENCE ON INFORMATION SECURITY, ASIAJCIS 2024, 2024, : 91 - 98
  • [37] Multifaceted ECG Feature Extraction for AFIB Detection: Using Traditional Machine Learning Techniques
    Nguyen, Tri M.
    Nguyen, Hien D.
    Hung Nguyen
    Xuan-Hau Pham
    Tran, Dung A.
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, PT I, ACIIDS 2024, 2024, 14795 : 108 - 119
  • [38] Feature Extraction Evaluation of Various Machine Learning Methods for Finger Movement Classification using Double Myo Armband
    Anam, Khairul
    Ismail, Harun
    Hanggara, Faruq S.
    Avian, Cries
    Nahela, Safri
    Sasono, Muchamad Arif Hana
    JOURNAL OF ENGINEERING AND TECHNOLOGICAL SCIENCES, 2023, 55 (05): : 587 - 599
  • [39] Improving Machine Learning Models for Malware Detection Using Embedded Feature Selection Method
    Chemmakha, Mohammed
    Habibi, Omar
    Lazaar, Mohamed
    IFAC PAPERSONLINE, 2022, 55 (12): : 771 - 776
  • [40] An Effective Malware Detection Method Using Hybrid Feature Selection and Machine Learning Algorithms
    Namita Dabas
    Prachi Ahlawat
    Prabha Sharma
    Arabian Journal for Science and Engineering, 2023, 48 : 9749 - 9767