Unsupervised dimensionality reduction versus supervised regularization for classification from sparse data

被引:9
作者
Clark, Jessica [1 ]
Provost, Foster [2 ]
机构
[1] Univ Maryland, Robert H Smith Sch Business, 7621 Mowatt Lane, College Pk, MD 20742 USA
[2] NYU, Stern Sch Business, 44 West Fourth St, New York, NY 10012 USA
关键词
Dimensionality reduction; Binary classification; Sparse data; Experimental comparison; Data mining; COMPONENT ANALYSIS; CANCER; PCA; CUSTOMERS; SELECTION; IMPROVE;
D O I
10.1007/s10618-019-00616-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Unsupervised matrix-factorization-based dimensionality reduction (DR) techniques are popularly used for feature engineering with the goal of improving the generalization performance of predictive models, especially with massive, sparse feature sets. Often DR is employed for the same purpose as supervised regularization and other forms of complexity control: exploiting a bias/variance tradeoff to mitigate overfitting. Contradicting this practice, there is consensus among existing expert guidelines that supervised regularization is a superior way to improve predictive performance. However, these guidelines are not always followed for this sort of data, and it is not unusual to find DR used with no comparison to modeling with the full feature set. Further, the existing literature does not take into account that DR and supervised regularization are often used in conjunction. We experimentally compare binary classification performance using DR features versus the original features under numerous conditions: using a total of 97 binary classification tasks, 6 classifiers, 3 DR techniques, and 4 evaluation metrics. Crucially, we also experiment using varied methodologies to tune and evaluate various key hyperparameters. We find a very clear, but nuanced result. Using state-of-the-art hyperparameter-selection methods, applying DR does not add value beyond supervised regularization, and can often diminish performance. However, if regularization is not done well (e.g., one just uses the default regularization parameter), DR does have relatively better performancebut these approaches result in lower performance overall. These latter results provide an explanation for why practitioners may be continuing to use DR without undertaking the necessary comparison to using the original features. However, this practice seems generally wrongheaded in light of the main results, if the goal is to maximize generalization performance.
引用
收藏
页码:871 / 916
页数:46
相关论文
共 78 条
[1]   Extracting underlying meaningful features and canceling noise using independent component analysis for direct marketing [J].
Ahn, Hyunchul ;
Choi, Eunsup ;
Han, Ingoo .
EXPERT SYSTEMS WITH APPLICATIONS, 2007, 33 (01) :181-191
[2]  
Altun K, 2010, LECT NOTES COMPUT SC, V6219, P38, DOI 10.1007/978-3-642-14715-9_5
[3]   Comparative study on classifying human activities with miniature inertial and magnetic sensors [J].
Altun, Kerem ;
Barshan, Billur ;
Tuncel, Orkun .
PATTERN RECOGNITION, 2010, 43 (10) :3605-3620
[4]  
Amini M.R., 2009, P 22 INT C NEURAL IN, P28
[5]  
[Anonymous], 2002, J MACH LEARN RES
[6]  
[Anonymous], 2009, Tech. report TiCC TR 2009-005
[7]  
[Anonymous], 2001, The elements of statistical learning: data mining, inference, and prediction
[8]  
Arulogun OT, 2012, COMPUT INF SYST DEV, V3, P1
[9]   Recognizing Daily and Sports Activities in Two Open Source Machine Learning Environments Using Body-Worn Sensor Units [J].
Barshan, Billur ;
Yuksek, Murat Cihan .
COMPUTER JOURNAL, 2014, 57 (11) :1649-1667
[10]  
Bellman Richard Ernest., 1961, Adaptive control processes: A guided tour