A scalable decision tree system and its application in pattern recognition and intrusion detection

被引:45
作者
Li, XB [1 ]
机构
[1] Univ Massachusetts, Coll Management, Lowell, MA 01854 USA
关键词
decision trees; data mining; classification; linear discriminants; pattern recognition; intrusion detection;
D O I
10.1016/j.dss.2004.06.016
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
One of the most challenging problems in data mining is to develop scalable algorithms capable of mining massive data sets whose sizes exceed the capacity of a computer's memory. In this paper, we propose a new decision tree algorithm, named SURPASS (for Scaling Up Recursive Partitioning with Sufficient Statistics), that is highly effective in handling such large data, SURPASS incorporates linear discriminants into decision trees' recursive partitioning process. In SURPASS, the information required to build a decision tree is summarized into a set of sufficient statistics, which can be gathered incrementally from the data, by reading a subset of the data from storage space to main memory one at a time. As a result, the data size that can be handled by this algorithm is independent of memory size. We apply SURPASS to three large data sets pertaining to pattern recognition and intrusion detection problems. The results indicate that SURPASS scales up well against large data sets and produces decision tree models with very high quality. (c) 2004 Elsevier B.V. All rights reserved.
引用
收藏
页码:112 / 130
页数:19
相关论文
共 37 条
[1]  
Anderson J.P., 1980, Computer security threat monitoring and surveillance
[2]   Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables [J].
Blackard, JA ;
Dean, DJ .
COMPUTERS AND ELECTRONICS IN AGRICULTURE, 1999, 24 (03) :131-151
[3]  
BLAKE CL, 1998, UCI RESP MACHINE LEA
[4]   Bagging predictors [J].
Breiman, L .
MACHINE LEARNING, 1996, 24 (02) :123-140
[5]  
Breiman L., 1998, CLASSIFICATION REGRE
[6]  
BRODLEY CE, 1995, MACH LEARN, V19, P45, DOI 10.1007/BF00994660
[7]   AN INTRUSION-DETECTION MODEL [J].
DENNING, DE .
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1987, 13 (02) :222-232
[8]   A decision-theoretic generalization of on-line learning and an application to boosting [J].
Freund, Y ;
Schapire, RE .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1997, 55 (01) :119-139
[9]  
Gama J., 1999, Intelligent Data Analysis, V3, P1, DOI 10.1016/S1088-467X(99)00002-5
[10]  
GEHRKE J, 1998, P 24 INT C VERY LARG