Prediction-Based Structured Variable Selection through the Receiver Operating Characteristic Curves

被引:15
作者
Wang, Yuanjia [1 ]
Chen, Huaihou [1 ]
Li, Runze [2 ,3 ]
Duan, Naihua
Lewis-Fernandez, Roberto [4 ]
机构
[1] Columbia Univ, Mailman Sch Publ Hlth, Dept Biostat, New York, NY 10032 USA
[2] Penn State Univ, Dept Stat, University Pk, PA 16802 USA
[3] Penn State Univ, Methodol Ctr, University Pk, PA 16802 USA
[4] Columbia Univ, Dept Psychiat, New York State Psychiat Inst, New York, NY 10032 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
Area under the curve; Disease screening; Hierarchical variable selection; ROC curve; Support vector machine; NONCONCAVE PENALIZED LIKELIHOOD; CLASSIFICATION; REGRESSION; ACCURACY; MODEL;
D O I
10.1111/j.1541-0420.2010.01533.x
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
In many clinical settings, a commonly encountered problem is to assess accuracy of a screening test for early detection of a disease. In these applications, predictive performance of the test is of interest. Variable selection may be useful in designing a medical test. An example is a research study conducted to design a new screening test by selecting variables from an existing screener with a hierarchical structure among variables: there are several root questions followed by their stem questions. The stem questions will only be asked after a subject has answered the root question. It is therefore unreasonable to select a model that only contains stem variables but not its root variable. In this work, we propose methods to perform variable selection with structured variables when predictive accuracy of a diagnostic test is the main concern of the analysis. We take a linear combination of individual variables to form a combined test. We then maximize a direct summary measure of the predictive performance of the test, the area under a receiver operating characteristic curve (AUC of an ROC), subject to a penalty function to control for overfitting. Since maximizing empirical AUC of the ROC of a combined test is a complicated nonconvex problem (Pepe, Cai, and Longton, 2006, Biometrics 62, 221-229), we explore the connection between the empirical AUC and a support vector machine (SVM). We cast the problem of maximizing predictive performance of a combined test as a penalized SVM problem and apply a reparametrization to impose the hierarchical structure among variables. We also describe a penalized logistic regression variable selection procedure for structured variables and compare it with the ROC-based approaches. We use simulation studies based on real data to examine performance of the proposed methods. Finally we apply developed methods to design a structured screener to be used in primary care clinics to refer potentially psychotic patients for further specialty diagnostics and treatment.
引用
收藏
页码:896 / 905
页数:10
相关论文
共 31 条
[1]  
[Anonymous], 2003, The Statistical Evaluation of Medical Tests for Classification and Prediction
[2]  
[Anonymous], P 22 INT C MACH LEAR
[3]  
BEBBINGTON P, 1995, INT J METHOD PSYCH, V5, P11
[4]   penalizedSVM: a R-package for feature selection SVM classification [J].
Becker, Natalia ;
Werft, Wiebke ;
Toedt, Grischa ;
Lichter, Peter ;
Benner, Axel .
BIOINFORMATICS, 2009, 25 (13) :1711-1712
[5]  
Briggs WM, 2008, BIOMETRICS, V64, P250, DOI 10.1111/j.1541-0420.2007.00781_1.x
[6]  
Calders, 2007, LECT NOTES ARTIF INT, P42
[8]   Variable selection via nonconcave penalized likelihood and its oracle properties [J].
Fan, JQ ;
Li, RZ .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2001, 96 (456) :1348-1360
[9]  
First M. B., 2016, SCID 5 CV STRUCTURED
[10]  
Friedman J., 2001, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, V1