Sampling method based on improved C4.5 decision tree and its application in prediction of telecom customer churn

被引:0
作者
Deng W. [1 ]
Deng L. [1 ]
Liu J. [1 ]
Qi J. [2 ]
机构
[1] School of Management and Economics, Chongqing University of Posts and Telecommunications, Nan'an district, Chongqing
[2] China Telecom Co., Ltd. Hefei Branch, 255, Changjiang West Road, Hefei
来源
International Journal of Information Technology and Management | 2019年 / 18卷 / 01期
关键词
Data mining; Decision tree; Imbalanced data; Over-sampling; Telecom customer churn; Under-sampling;
D O I
10.1504/IJITM.2019.097887
中图分类号
学科分类号
摘要
Nowadays, customer churn prediction is quite important for telecom operators to reduce churn rate and remain competitive. However, the imbalance between the retained customers and the churners affects the prediction accuracy. For solving this problem, a new sampling method based on improved C4.5 decision tree is proposed. Firstly, an initial weight is set for each sample according to the data scale of each class. Then, the samples' weight is adjusted through several rounds of alternative training by the improved C4.5 decision tree algorithm. Both the gain ratio and the misclassification cost are considered for splitting criterion. Besides, the boundary minority examples and the centre majority examples are found according to their weights. Furthermore, over-sampling is conducted for the boundary minority examples by synthetic minority over-sampling technique (SMOTE) and under-sampling is executed for the majority examples. Experiments on UCI public data and telecom operator data show the efficiency of the new method. Copyright © 2019 Inderscience Enterprises Ltd.
引用
收藏
页码:93 / 109
页数:16
相关论文
共 29 条
[21]  
Ruggieri S., Efficient C4.5, IEEE Transactions on Knowledge and Data Engineering, 14, 2, pp. 438-444, (2002)
[22]  
Siers M.J., Islam M.Z., Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem, Information Systems, 51, 100, pp. 62-71, (2015)
[23]  
Su C.T., Chen L.S., Yin Y., Knowledge acquisition through information granulation for imbalanced data, Expert Systems with Applications, 31, 3, pp. 531-541, (2006)
[24]  
Tang M., Yang C., Zhang K., Xie Q., Cost-sensitive support vector machine using randomized dual coordinate descent method for big class-imbalanced data classification, Abstract and Applied Analysis, pp. 1-9, (2014)
[25]  
Wang S., Yao X., Diversity analysis on imbalanced data sets by using ensemble models, IEEE Symposium on Computational Intelligence and Data Mining (CIDM' 09) 30, (2009)
[26]  
Xia G.N., Research on current situation and development of customer churn prediction, Application Research of Computers, 27, 2, pp. 413-416, (2010)
[27]  
Xiao J., Xie L., He C.Z., Jiang X.Y., Dynamic classifier ensemble model for customer classification with imbalanced class distribution, Expert Systems with Applications, 39, 3, pp. 3668-3675, (2012)
[28]  
Yang Z., Gao D., An active under-sampling approach for imbalanced data classification, 5th IEEE International Symposium on Computational Intelligence and Design (ISCID), (2012)
[29]  
Zhao F.Y., Wang C.J., Chen S.F., Data mining on imbalanced data sets, Computer Science, 34, 9, pp. 139-141, (2007)