Toward feature selection in big data preprocessing based on hybrid cloud-based model

被引：0

作者：

Noha Shehab

Mahmoud Badawy

H Arafat Ali

机构：

[1] Mansoura University,Computers and Control Systems Engineering Department, Faculty of Engineering

[2] Ministry of Communications and Information Technology.,Information Technology Institute, Open Source Dept.

[3] Taibah University,undefined

[4] Computer Science and Information Dept.,undefined

来源：

The Journal of Supercomputing | 2022年 / 78卷

关键词：

Analysis; Big data; Classification; Cloud; Feature selection; Firefly; WKNN;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Recently, big data are widely noticed in many fields like machine learning, pattern recognition, medical, financial, and transportation fields. Data analysis is crucial to converting data into more specific information fed to the decision-making systems. With the diverse and complex types of datasets, knowledge discovery becomes more difficult. One solution is to use feature subset selection preprocessing that reduces this complexity, so the computation and analysis become convenient. Preprocessing produces a reliable and suitable source for any data-mining algorithm. The effective features’ selection can improve a model’s performance and help us understand the characteristics and underlying structure of complex data. This study introduces a novel hybrid feature selection cloud-based model for imbalanced data based on the k nearest neighbor algorithm. The proposed model showed good performance compared with the simple weighted nearest neighbor. The proposed model combines the firefly distance metric and the Euclidean distance used in the k nearest neighbor. The experimental results showed good insights in both time usage and feature weights compared with the weighted nearest neighbor. It also showed improvement in the classification accuracy by 12% compared with the weighted nearest neighbor algorithm. And using the cloud-distributed model reduced the processing time up to 30%, which is deliberated to be substantial compared with the recent state-of-the-art methods.

引用

页码：3226 / 3265

页数：39

共 150 条

[1] García-Gil D(2019)Enabling smart data: noise filtering in big data classification Inf Sci 479 135-152
[2] Luengo J(2020)Unsupervised feature selection by self-paced learning regularization Pattern Recogni Lett 132 4-11
[3] García S(2020)An analysis on new hybrid parameter selection model performance over big data set Knowl-Based Syst 192 105441-45950
[4] Herrera F(2016)Big data preprocessing: methods and prospects Big Data Anal 1 9-17
[5] Zheng W(2020)Big data driven edge-cloud collaboration architecture for cloud manufacturing: a software defined perspective IEEE Access 8 45938-57
[6] Zhu X(2014)Big data (lost) in the cloud Int J Big Data Intell 1 3-1284
[7] Wen G(2014)Cap: community activity prediction based on big data analysis IEEE Netw 28 52-1233
[8] Zhu Y(2009)Learning from imbalanced data IEEE Trans Knowl Data Eng 21 1263-2797
[9] Yu H(2020)Learning imbalanced datasets based on smote and gaussian distribution Inf Sci 512 1214-135
[10] Gan J(2020)Efficient and effective training of covid-19 classification networks with self-supervised dual-track learning to rank IEEE J Biomed Health Inf 24 2787-156

← 1 2 3 4 5 6 7 8 9 10 →