Sampling scheme-based classification rule mining method using decision tree in big data environment

被引:22
作者
Jin, Chenxia [1 ]
Li, Fachao [1 ]
Ma, Shijie [1 ]
Wang, Ying [2 ]
机构
[1] Hebei Univ Sci & Technol, Sch Econ & Management, Shijiazhuang 050018, Hebei, Peoples R China
[2] Hebei Univ Sci & Technol, Sch Sci, Shijiazhuang 050024, Hebei, Peoples R China
基金
中国国家自然科学基金;
关键词
Classification rules; Decision tree; Sampling; Reliability; Big data; ATTRIBUTE SELECTION; ALGORITHM; SYSTEM; ID3;
D O I
10.1016/j.knosys.2022.108522
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Obtaining comprehensible classification rules may be extremely important in many real applications such as data-driven decision-making and classification tasks. Decision-tree methods are powerful and popular tools for acquiring classification rules. However, they do not show good performance, and the base data processing methods lack strong theoretical support in big data scenarios. This study introduces a sampling scheme with and without the replacement of the implementations of decision tree methods. This method, called sampling-based classification rule mining (SCRM), is designed to improve the adaptation and generalization ability of classification rules in a big-data environment. Sampling without replacement is conducted to refine classification rules using the concept of conflict and coverage rules, while sampling with replacement is applied to determine rule reliability; the reliability approximation property of classification rules is proved by using the law of large numbers. The effectiveness of the SCRM was evaluated and verified using seven UCI datasets. Theoretical analysis and experimental results show that SCRM is generic with good classification ability, thereby improving the classification accuracy of the rules. SCRM has a significant advantage as it provides theoretical and methodological support for the classification rule mining of big data. Therefore, the SCRM can be used in many applications. (c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页数:14
相关论文
共 51 条
[1]   DATABASE MINING - A PERFORMANCE PERSPECTIVE [J].
AGRAWAL, R ;
IMIELINSKI, T ;
SWAMI, A .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1993, 5 (06) :914-925
[2]  
[Anonymous], 2011, Scaling up machine learning: Parallel and distributed approaches
[3]  
[Anonymous], Scientific Programming
[4]   An effective feature selection method for web spam detection [J].
Asdaghi, Faeze ;
Soleimani, Ali .
KNOWLEDGE-BASED SYSTEMS, 2019, 166 :198-206
[5]   A Survey of Evolutionary Algorithms for Decision-Tree Induction [J].
Barros, Rodrigo Coelho ;
Basgalupp, Marcio Porto ;
de Carvalho, Andre C. P. L. F. ;
Freitas, Alex A. .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2012, 42 (03) :291-312
[6]   Evolving decision trees with beam search-based initialization and lexicographic multi-objective evaluation [J].
Basgalupp, Marcio P. ;
Barros, Rodrigo C. ;
de Carvalho, Andre C. P. L. F. ;
Freitas, Alex A. .
INFORMATION SCIENCES, 2014, 258 :160-181
[7]   A Map Reduce solution for associative classification of big data [J].
Bechini, Alessio ;
Marcelloni, Francesco ;
Segatori, Armando .
INFORMATION SCIENCES, 2016, 332 :33-55
[8]   On the computational complexity of Dempster's Rule of combination, a parallel computing approach [J].
Benalla, Mohammed ;
Achchab, Boujemaa ;
Hrimech, Hamid .
JOURNAL OF COMPUTATIONAL SCIENCE, 2021, 50
[9]   Improved flood susceptibility mapping using a best first decision tree integrated with ensemble learning techniques [J].
Binh Thai Pham ;
Jaafari, Abolfazl ;
Tran Van Phong ;
Hoang Phan Hai Yen ;
Tran Thi Tuyen ;
Vu Van Luong ;
Huu Duy Nguyen ;
Hiep Van Le ;
Foong, Loke Kok .
GEOSCIENCE FRONTIERS, 2021, 12 (03)
[10]  
Breiman L., 1984, Classification and Regression Trees, V1st, DOI [DOI 10.1201/9781315139470, 10.1201/9781315139470]