Autoencoders for sample size estimation for fully connected neural network classifiers

被引:4
作者
Gulamali, Faris F. F. [1 ]
Sawant, Ashwin S. S. [1 ]
Kovatch, Patricia [1 ]
Glicksberg, Benjamin [1 ]
Charney, Alexander [1 ]
Nadkarni, Girish N. N. [1 ]
Oermann, Eric [2 ]
机构
[1] Icahn Sch Med, New York, NY 10029 USA
[2] NYU, New York, NY 10016 USA
基金
美国国家卫生研究院;
关键词
POWER;
D O I
10.1038/s41746-022-00728-0
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Sample size estimation is a crucial step in experimental design but is understudied in the context of deep learning. Currently, estimating the quantity of labeled data needed to train a classifier to a desired performance, is largely based on prior experience with similar models and problems or on untested heuristics. In many supervised machine learning applications, data labeling can be expensive and time-consuming and would benefit from a more rigorous means of estimating labeling requirements. Here, we study the problem of estimating the minimum sample size of labeled training data necessary for training computer vision models as an exemplar for other deep learning problems. We consider the problem of identifying the minimal number of labeled data points to achieve a generalizable representation of the data, a minimum converging sample (MCS). We use autoencoder loss to estimate the MCS for fully connected neural network classifiers. At sample sizes smaller than the MCS estimate, fully connected networks fail to distinguish classes, and at sample sizes above the MCS estimate, generalizability strongly correlates with the loss function of the autoencoder. We provide an easily accessible, code-free, and dataset-agnostic tool to estimate sample sizes for fully connected networks. Taken together, our findings suggest that MCS and convergence estimation are promising methods to guide sample size estimates for data collection and labeling prior to training deep learning models in computer vision.
引用
收藏
页数:8
相关论文
共 34 条
[1]   Analysis of recent failures of disease modifying therapies in Alzheimer's disease suggesting a new methodology for future studies [J].
Amanatkar, Hamid Reza ;
Papagiannopoulos, Bill ;
Grossberg, George Thomas .
EXPERT REVIEW OF NEUROTHERAPEUTICS, 2017, 17 (01) :7-16
[2]  
Arjovsky M, 2017, PR MACH LEARN RES, V70
[3]   Sample-Size Determination Methodologies for Machine Learning in Medical Imaging Research: A Systematic Review [J].
Balki, Indranil ;
Amirabadi, Afsaneh ;
Levman, Jacob ;
Martel, Anne L. ;
Emersic, Ziga ;
Meden, Blaz ;
Garcia-Pedrero, Angel ;
Ramirez, Saul C. ;
Kong, Dehan ;
Moody, Alan R. ;
Tyrrell, Pascal N. .
CANADIAN ASSOCIATION OF RADIOLOGISTS JOURNAL-JOURNAL DE L ASSOCIATION CANADIENNE DES RADIOLOGISTES, 2019, 70 (04) :344-353
[4]   Power failure: why small sample size undermines the reliability of neuroscience [J].
Button, Katherine S. ;
Ioannidis, John P. A. ;
Mokrysz, Claire ;
Nosek, Brian A. ;
Flint, Jonathan ;
Robinson, Emma S. J. ;
Munafo, Marcus R. .
NATURE REVIEWS NEUROSCIENCE, 2013, 14 (05) :365-376
[5]   Effect size and statistical power in the rodent fear conditioning literature - A systematic review [J].
Carneiro, Clarissa F. D. ;
Moulin, Thiago C. ;
Macleod, Malcolm R. ;
Amaral, Olavo B. .
PLOS ONE, 2018, 13 (04)
[6]   Ethical Machine Learning in Healthcare [J].
Chen, Irene Y. ;
Pierson, Emma ;
Rose, Sherri ;
Joshi, Shalmali ;
Ferryman, Kadija ;
Ghassemi, Marzyeh .
ANNUAL REVIEW OF BIOMEDICAL DATA SCIENCE, VOL 4, 2021, 4 :123-144
[7]  
Coates A., 2011, AISTATS
[8]  
Cohen G, 2017, IEEE IJCNN, P2921, DOI 10.1109/IJCNN.2017.7966217
[9]   Sample size planning for developing classifiers using high-dimensional DNA microarray data [J].
Dobbin, Kevin K. ;
Simon, Richard M. .
BIOSTATISTICS, 2007, 8 (01) :101-117
[10]  
Doersch C., 2016, ARXIV PREPRINT ARXIV