Optimal ratio for data splitting

被引:360
作者
Joseph, V. Roshan [1 ]
机构
[1] Georgia Inst Technol, H Milton Stewart Sch Ind & Syst Engn, Atlanta, GA 30332 USA
基金
美国国家科学基金会;
关键词
testing; training; validation; CALIBRATION; VALIDATION; MODELS;
D O I
10.1002/sam.11583
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article, we show that the optimal training/testing splitting ratio is root p : 1, where p is the number of parameters in a linear regression model that explains the data well.
引用
收藏
页码:531 / 538
页数:8
相关论文
共 26 条
[1]   Optimality of training/test size and resampling effectiveness in cross-validation [J].
Afendras, Georgios ;
Markatou, Marianthi .
JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2019, 199 :286-301
[2]  
Akaike H., 1998, 2 INT S INF THEOR, P199, DOI [DOI 10.1007/978-1-4612-1694-015, DOI 10.1007/978-1-4612-1694-0_15]
[3]   Spatial Prediction of Rainfall-Induced Landslides Using Aggregating One-Dependence Estimators Classifier [J].
Binh Thai Pham ;
Prakash, Indra ;
Jaafari, Abolfazl ;
Dieu Tien Bui .
JOURNAL OF THE INDIAN SOCIETY OF REMOTE SENSING, 2018, 46 (09) :1457-1470
[4]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[5]   Optimally splitting cases for training and testing high dimensional classifiers [J].
Dobbin, Kevin K. ;
Simon, Richard M. .
BMC MEDICAL GENOMICS, 2011, 4
[6]  
Dua D, 2017, UCI machine learning repository
[7]  
Dubbs A., 2021, ARXIV PREPRINT ARXIV
[8]   Regularization Paths for Generalized Linear Models via Coordinate Descent [J].
Friedman, Jerome ;
Hastie, Trevor ;
Tibshirani, Rob .
JOURNAL OF STATISTICAL SOFTWARE, 2010, 33 (01) :1-22
[9]   A method for calibration and validation subset partitioning [J].
Galvao, RKH ;
Araujo, MCU ;
José, GE ;
Pontes, MJC ;
Silva, EC ;
Saldanha, TCB .
TALANTA, 2005, 67 (04) :736-740
[10]   laGP: Large-Scale Spatial Modeling via Local Approximate Gaussian Processes in R [J].
Gramacy, Robert B. .
JOURNAL OF STATISTICAL SOFTWARE, 2016, 72 (01) :1-46