Selecting critical features for data classification based on machine learning methods

被引：499

作者：

Chen, Rung-Ching ^{[1
]}

Dewi, Christine ^{[1
,2
]}

Huang, Su-Wen ^{[1
,3
]}

Caraka, Rezzy Eko ^{[1
]}

机构：

[1] Chaoyang Univ Technol, Dept Informat Management, 168 Jifong East Rd, Taichung 41349, Taiwan

[2] Satya Wacana Christian Univ, Fac Informat Technol, Salatiga 50711, Central Java, Indonesia

[3] Taichung Vet Gen Hosp Taiwan, Off Gen Affairs, 1650 Taiwan Blvd Sect 4, Taichung 40705, Taiwan

来源：

JOURNAL OF BIG DATA | 2020年 / 7卷 / 01期

关键词：

Random Forest; Features selection; SVM; Classification; KNN; LDA; REGRESSION TREES; MODELS; ALGORITHM;

D O I：

10.1186/s40537-020-00327-4

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Feature selection becomes prominent, especially in the data sets with many variables and features. It will eliminate unimportant variables and improve the accuracy as well as the performance of classification. Random Forest has emerged as a quite useful algorithm that can handle the feature selection issue even with a higher number of variables. In this paper, we use three popular datasets with a higher number of variables (Bank Marketing, Car Evaluation Database, Human Activity Recognition Using Smartphones) to conduct the experiment. There are four main reasons why feature selection is essential. First, to simplify the model by reducing the number of parameters, next to decrease the training time, to reduce overfilling by enhancing generalization, and to avoid the curse of dimensionality. Besides, we evaluate and compare each accuracy and performance of the classification model, such as Random Forest (RF), Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA). The highest accuracy of the model is the best classifier. Practically, this paper adopts Random Forest to select the important feature in classification. Our experiments clearly show the comparative study of the RF algorithm from different perspectives. Furthermore, we compare the result of the dataset with and without essential features selection by RF methods varImp(), Boruta, and Recursive Feature Elimination (RFE) to get the best percentage accuracy and kappa. Experimental results demonstrate that Random Forest achieves a better performance in all experiment groups.

引用

页数：26

共 113 条

[1] A New Feature Selection Technique for Load and Price Forecast of Electrical Power Systems [J].

Abedinia, Oveis ;

Amjady, Nima ;

Zareipour, Hamidreza .

IEEE TRANSACTIONS ON POWER SYSTEMS, 2017, 32 (01) :62-74

[2]

Altinbas H., 2015, Procedia Economics and Finance, V30, P22, DOI DOI 10.1016/S2212-5671(15)01251-4

[3] AN INTRODUCTION TO KERNEL AND NEAREST-NEIGHBOR NONPARAMETRIC REGRESSION [J].

ALTMAN, NS .

AMERICAN STATISTICIAN, 1992, 46 (03) :175-185

[4] A review of robust clustering methods [J].

Angel Garcia-Escudero, Luis ;

Gordaliza, Alfonso ;

Matran, Carlos ;

Mayo-Iscar, Agustin .

ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2010, 4 (2-3) :89-109

[5]

[Anonymous], 2020, Sylvan

[6]

[Anonymous], 2007, MULTIPLE CLASSIFIER

[7]

[Anonymous], 2019, TELKOMNIKA TELECOMMU, DOI DOI 10.12928/TELKOMNIKA.V17I4.11788

[8]

[Anonymous], 2001, Kybernetes, DOI [DOI 10.1108/K.2001.30.1.103.6, 10.1609/aimag.v22i2.1566, DOI 10.1609/AIMAG.V22I2.1566]

[9]

[Anonymous], 2017, INT J INNOV RES TECH

[10]

Asim S., 2020, Int. J. Sci. Eng. Res, V11, P469

← 1 2 3 4 5 6 7 8 9 10 →