A new hybrid ensemble feature selection framework for machine learning-based phishing detection system

被引:192
作者
Chiew, Kang Leng [1 ]
Tan, Choon Lin [1 ]
Wong, KokSheik [2 ]
Yong, Kelvin S. C. [3 ]
Tiong, Wei King [1 ]
机构
[1] Univ Malaysia Sarawak, Fac Comp Sci & Informat Technol, Kota Samarahan 94300, Sarawak, Malaysia
[2] Monash Univ Malaysia, Sch Informat Technol, Bandar Sunway 47500, Selangor, Malaysia
[3] Curtin Univ, Dept Elect & Comp Engn, Fac Engn & Sci, CDT 250, Miri 98009, Sarawak, Malaysia
关键词
Phishing detection; Feature selection; Machine learning; Ensemble-based; Classification; Phishing dataset; CLASSIFICATION;
D O I
10.1016/j.ins.2019.01.064
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes a new feature selection framework for machine learning-based phishing detection system, called the Hybrid Ensemble Feature Selection (HEFS). In the first phase of HEFS, a novel Cumulative Distribution Function gradient (CDF-g) algorithm is exploited to produce primary feature subsets, which are then fed into a data perturbation ensemble to yield secondary feature subsets. The second phase derives a set of baseline features from the secondary feature subsets by using a function perturbation ensemble. The overall experimental results suggest that HEFS performs best when it is integrated with Random Forest classifier, where the baseline features correctly distinguish 94.6% of phishing and legitimate websites using only 20.8% of the original features. In another experiment, the baseline features (10 in total) utilised on Random Forest outperforms the set of all features (48 in total) used on SVM, Naive Bayes, C4.5, JRip, and PART classifiers. HEFS also shows promising results when benchmarked using another well-known phishing dataset from the University of California Irvine (UCI) repository. Hence, the HEFS is a highly desirable and practical feature selection technique for machine learning-based phishing detection systems. (C) 2019 Elsevier Inc. All rights reserved.
引用
收藏
页码:153 / 166
页数:14
相关论文
共 30 条
[1]   Phishing detection based Associative Classification data mining [J].
Abdelhamid, Neda ;
Ayesh, Aladdin ;
Thabtah, Fadi .
EXPERT SYSTEMS WITH APPLICATIONS, 2014, 41 (13) :5948-5959
[2]  
Aburrous Maher, 2010, 2010 International Conference on Multimedia Computing and Information Technology (MCIT 2010), P9, DOI 10.1109/MCIT.2010.5444840
[3]  
Anti-Phishing Working Group I., 2018, PHISH ACT TRENDS REP
[4]  
Basnet Ram B., 2012, Advanced Research in Applied Artificial Intelligence. Proceedings 25th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2012, P252, DOI 10.1007/978-3-642-31087-4_27
[5]  
Bleau H., 2017, Global fraud and cybercrime forecast
[6]   Utilisation of website logo for phishing detection [J].
Chiew, Kang Leng ;
Chang, Ee Hung ;
Sze, San Nah ;
Tiong, Wei King .
COMPUTERS & SECURITY, 2015, 54 :16-26
[7]  
Fahmy H.M., 2011, 2011 INT C COMMUNICA, P1
[8]  
Garera S, 2007, WORM'07: PROCEEDINGS OF THE 2007 ACM WORKSHOP ON RECURRING MALCODE, P1
[9]   A comprehensive and efficacious architecture for detecting phishing webpages [J].
Gowtham, R. ;
Krishnamurthi, Ilango .
COMPUTERS & SECURITY, 2014, 40 :23-37
[10]   ACPRISM: Associative classification based on PRISM algorithm [J].
Hadi, Wa'el ;
Issa, Ghassan ;
Ishtaiwi, Abdelraouf .
INFORMATION SCIENCES, 2017, 417 :287-300