Facilitating data preprocessing by a generic framework: a proposal for clustering

被引:19
作者
Kirchner, Kathrin [1 ]
Zec, Jelena [2 ]
Delibasic, Boris [2 ]
机构
[1] Berlin Sch Econ & Law, Alt Friedrichsfelde 60, D-10315 Berlin, Germany
[2] Univ Belgrade, Fac Org Sci, Belgrade, Serbia
关键词
Clustering algorithm; Preprocessing in data mining; Generic framework; Preprocessing stream selection; NONLINEAR DIMENSIONALITY REDUCTION; DATA MINING PROCESS; KNOWLEDGE;
D O I
10.1007/s10462-015-9446-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering is among the most popular data mining algorithm families. Before applying clustering algorithms to datasets, it is usually necessary to preprocess the data properly. Data preprocessing is a crucial, still neglected step in data mining. Although preprocessing techniques and algorithms are well-known, the preprocessing process is very complex and takes usually a lot of time. Instead of handling preprocessing more systematically, it is usually undervalued, i.e. more emphasis is put on choosing the appropriate clustering algorithm and setting its parameters. In our opinion, this is not because preprocessing is less important, but because it is difficult to choose the best sequence of preprocessing algorithms. We argue that it is important to better standardize this process so it is performed efficiently. Therefore, this paper proposes a generic framework for data preprocessing. It is based on a survey with data mining experts, as well as a literature and software review. The framework enables pipelining preprocessing algorithms and methods which facilitate further automated preprocessing design and the selection of a suitable preprocessing stream. The proposed framework is easily extendible, so it can be applied to other data mining algorithm families that have their own idiosyncrasies.
引用
收藏
页码:271 / 297
页数:27
相关论文
共 92 条
[1]  
Ankerst M., 1999, SIGMOD Record, V28, P49, DOI 10.1145/304181.304187
[2]  
[Anonymous], 1997, SOFTWARE ENG REUSABL
[3]  
[Anonymous], P IEEE C SYST MAN CY
[4]  
[Anonymous], 1993, Advances in neural information processing systems
[5]  
Bache K., 2013, UCI Machine Learning Repository
[6]  
Belkin M, 2002, ADV NEUR IN, V14, P585
[7]  
Berkhin P, 2006, GROUPING MULTIDIMENSIONAL DATA: RECENT ADVANCES IN CLUSTERING, P25
[8]   Toward intelligent assistance for a data mining process: An ontology-based approach for cost-sensitive classification [J].
Bernstein, A ;
Provost, F ;
Hill, S .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (04) :503-518
[9]   KNIME:: The Konstanz Information Miner [J].
Berthold, Michael R. ;
Cebron, Nicolas ;
Dill, Fabian ;
Gabriel, Thomas R. ;
Koetter, Tobias ;
Meinl, Thorsten ;
Ohl, Peter ;
Sieb, Christoph ;
Thiel, Kilian ;
Wiswedel, Bernd .
DATA ANALYSIS, MACHINE LEARNING AND APPLICATIONS, 2008, :319-326
[10]  
Bezdek J. C., 1981, Pattern recognition with fuzzy objective function algorithms