The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing

被引:133
作者
Crone, Sven F.
Lessmann, Stefan
Stahlbock, Robert
机构
[1] Univ Hamburg, Inst Informat Syst, D-20146 Hamburg, Germany
[2] Univ Lancaster, Dept Management Sci, Lancaster LA1 4YX, England
关键词
data mining; neural networks; data preprocessing; classification; marketing;
D O I
10.1016/j.ejor.2005.07.023
中图分类号
C93 [管理学];
学科分类号
12 ; 1201 ; 1202 ; 120202 ;
摘要
Corporate data mining faces the challenge of systematic knowledge discovery in large data streams to support managerial decision making. While research in operations research, direct marketing and machine learning focuses on the analysis and design of data mining algorithms, the interaction of data mining with the preceding phase of data preprocessing has not been investigated in detail. This paper investigates the influence of different preprocessing techniques of attribute scaling, sampling, coding of categorical as well as coding of continuous attributes on the classifier performance of decision trees, neural networks and support vector machines. The impact of different preprocessing choices is assessed on a real world dataset from direct marketing using a multifactorial analysis of variance on various performance metrics and method parameterisations. Our case-based analysis provides empirical evidence that data preprocessing has a significant impact on predictive accuracy, with certain schemes proving inferior to competitive approaches. In addition, it is found that (1) selected methods prove almost as sensitive to different data representations as to method parameterisations, indicating the potential for increased performance through effective preprocessing; (2) the impact of preprocessing schemes varies by method, indicating different 'best practice' setups to facilitate superior results of a particular method; (3) algorithmic sensitivity towards preprocessing is consequently an important criterion in method evaluation and selection which needs to be considered together with traditional metrics of predictive power and computational efficiency in predictive data mining. (c) 2005 Elsevier B.V. All rights reserved.
引用
收藏
页码:781 / 800
页数:20
相关论文
共 62 条