Supervised feature selection using principal component analysis

被引:14
作者
Rahmat, Fariq [1 ]
Zulkafli, Zed [2 ]
Ishak, Asnor Juraiza [1 ]
Rahman, Ribhan Zafira Abdul [1 ]
De Stercke, Simon [3 ]
Buytaert, Wouter [3 ]
Tahir, Wardah [4 ]
Ab Rahman, Jamalludin [5 ]
Ibrahim, Salwa [6 ]
Ismail, Muhamad [6 ]
机构
[1] Univ Putra Malaysia, Dept Elect & Elect Engn, Serdang 43400, Selangor, Malaysia
[2] Univ Putra Malaysia, Dept Civil Engn, Serdang 43400, Selangor, Malaysia
[3] Imperial Coll London, Dept Civil & Environm Engn, Skempton Bldg,South Kensington Campus, London SW7 2BX, England
[4] Univ Teknol MARA, Sch Civil Engn, Coll Engn, Shah Alam 40450, Selangor, Malaysia
[5] Int Islamic Univ Malaysia, Dept Community Med, Kulliyyah Med, Kuantan 25200, Pahang, Malaysia
[6] Minist Hlth, Negeri Sembilan State Hlth Dept, Seremban 70300, Negeri Sembilan, Malaysia
关键词
Supervised feature selection; Feature selection; LASSO; Principal component analysis; ANN; LEPTOSPIROSIS; INFORMATION;
D O I
10.1007/s10115-023-01993-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The principal component analysis (PCA) is widely used in computational science branches such as computer science, pattern recognition, and machine learning, as it can effectively reduce the dimensionality of high-dimensional data. In particular, it is a popular transformation method used for feature extraction. In this study, we explore PCA's ability for feature selection in regression applications. We introduce a new approach using PCA, called Targeted PCA to analyze a multivariate dataset that includes the dependent variable-it identifies the principal component with a high representation of the dependent variable and then examines the selected principal component to capture and rank the contribution of the non-dependent variables. The study also compares the feature selected with that resulting from a Least Absolute Shrinkage and Selection Operator (LASSO) regression. Finally, the selected features were tested in two regression models: multiple linear regression (MLR) and artificial neural network (ANN). The results are presented for three socioeconomic, environmental, and computer image processing datasets. Our study found that 2 of 3 random datasets have more than 50% similarity in the selected features by the PCA and LASSO regression methods. In the regression predictions, our PCA-selected features resulted in little difference compared to the LASSO regression-selected features in terms of the MLR prediction accuracy. However, the ANN regression demonstrated a faster convergence and a higher reduction of error.
引用
收藏
页码:1955 / 1995
页数:41
相关论文
共 37 条
[1]   Principal component analysis [J].
Abdi, Herve ;
Williams, Lynne J. .
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2010, 2 (04) :433-459
[2]  
Adelman R, 2017, J ETHN CRIM JUSTICE, V15, P52, DOI 10.1080/15377938.2016.1261057
[3]  
[Anonymous], 2012, Res J Anim Sci, DOI DOI 10.3923/RJNASCI.2012.12.25
[4]  
[Anonymous], 2016, ENV FACTORS ASS INCR
[5]   USING MUTUAL INFORMATION FOR SELECTING FEATURES IN SUPERVISED NEURAL-NET LEARNING [J].
BATTITI, R .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (04) :537-550
[6]   Rats About Town: A Systematic Review of Rat Movement in Urban Ecosystems [J].
Byers, Kaylee A. ;
Lee, Michael J. ;
Patrick, David M. ;
Himsworth, Chelsea G. .
FRONTIERS IN ECOLOGY AND EVOLUTION, 2019, 7
[7]  
Fischer MM, 2015, HDB RES METHODS APPL
[8]   A Computer-Assisted System for Diagnostic Workstations: Automated Bone Labeling for CT Images [J].
Furuhashi, Satoru ;
Abe, Katsumi ;
Takahashi, Motoichiro ;
Aizawa, Takuya ;
Shizukuishi, Takashi ;
Sakaguchi, Masakuni ;
Maebayashi, Toshiya ;
Tanaka, Ikue ;
Narata, Mitsuhiro ;
Sasaki, Yasuo .
JOURNAL OF DIGITAL IMAGING, 2009, 22 (06) :689-695
[9]  
Giersdorf J, 2017, ANAL FEATURE SELECTI
[10]  
Graf F, 2011, LECT NOTES COMPUT SC, V6892, P607, DOI 10.1007/978-3-642-23629-7_74