Uncertainty Based Optimal Sample Selection for Big Data

被引:0
作者
Ajmal, Saadia [1 ]
Ashfaq, Rana Aamir Raza [1 ]
Saleem, Kashif [2 ]
机构
[1] Bahauddin Zakariya Univ, Dept Comp Sci, Multan 6000, Pakistan
[2] King Saud Univ, Coll Appl Studies & Community Serv, Dept Comp Sci & Engn, Riyadh 11362, Saudi Arabia
关键词
Uncertainty; Classification algorithms; Big Data; Prototypes; Data mining; Training; Fuzzy sets; Big data; instance selection; machine learning; uncertainty; INSTANCE SELECTION; CATEGORIZATION; SETS;
D O I
10.1109/ACCESS.2022.3233598
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In Machine learning and pattern recognition, building a better predictive model is one of the key problems in the presence of big or massive data; especially, if that data contains noisy and unrepresentative data samples. These types of samples adversely affect the learning model and may degrade its performance. To alleviate this problem, sometimes, it becomes necessary to sample the data after eliminating unnecessary instances by maintaining the underlying distribution intact. This process is called sampling or instance selection (IS). However, in this process, a substantial computational cost is involved. This paper discusses an uncertainty based optimal sample selection (UBOSS) method which can select a subset of optimal samples efficiently. Our proposed work comprises three main steps; initially, it uses an IS method to identify the patterns of representative and unrepresentative samples from the original data set; then, an uncertainty-based selector is designed to obtain fuzziness (i.e., a type of uncertainty) of those samples using a classifier whose output is a membership or fuzzy vector; this process further utilizes the divide-and-conquer strategy to obtain a subset of representative samples. Experiments are conducted on six datasets to evaluate the performance of the proposed IS method. Results show that our proposed methodology outperforms when compared with the selection performance (i.e., optimum samples) of the baseline methods (i.e., CNN, IB3, and DROP3).
引用
收藏
页码:6284 / 6292
页数:9
相关论文
共 46 条
[1]   INSTANCE-BASED LEARNING ALGORITHMS [J].
AHA, DW ;
KIBLER, D ;
ALBERT, MK .
MACHINE LEARNING, 1991, 6 (01) :37-66
[2]  
Al-sharhan S, 2001, 10TH IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, VOLS 1-3, P1135, DOI 10.1109/FUZZ.2001.1008855
[3]  
Angiulli F, 2005, P 22 INT C MACH LEAR, P25, DOI [10.1145/1102351.1102355, DOI 10.1145/1102351.1102355]
[4]  
Angluin D., 1988, Machine Learning, V2, P319, DOI 10.1007/BF00116828
[5]   A review of instance selection methods [J].
Arturo Olvera-Lopez, J. ;
Ariel Carrasco-Ochoa, J. ;
Francisco Martinez-Trinidad, J. ;
Kittler, Josef .
ARTIFICIAL INTELLIGENCE REVIEW, 2010, 34 (02) :133-143
[6]   A new fast prototype selection method based on clustering [J].
Arturo Olvera-Lopez, J. ;
Ariel Carrasco-Ochoa, J. ;
Francisco Martinez-Trinidad, J. .
PATTERN ANALYSIS AND APPLICATIONS, 2010, 13 (02) :131-141
[7]   Impact of fuzziness categorization on divide and conquer strategy for instance selection [J].
Ashfaq, Rana Aamir Raza ;
Wang, Xi-Zhao .
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2017, 33 (02) :1007-1018
[8]   Toward an efficient fuzziness based instance selection methodology for intrusion detection system [J].
Ashfaq, Rana Aamir Raza ;
He, Yu-lin ;
Chen, De-gang .
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2017, 8 (06) :1767-1776
[9]   Multidimensional quantification of uncertainty and application to a turbulent mixing model [J].
Barmparousis, Christos ;
Drikakis, Dimitris .
INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN FLUIDS, 2017, 85 (07) :385-403
[10]  
Beghetto R.A., 2020, The Palgrave Encyclopedia of the Possible, DOI DOI 10.1007/978-3-319-98390-5_122-1