Sample size determination for biomedical big data with limited labels

被引：8

作者：

Richter, Aaron N. ^{[1
]}

Khoshgoftaar, Taghi M. ^{[1
]}

机构：

[1] Florida Atlantic Univ, Dept Comp & Elect Engn & Comp Sci, 777 Glades Rd, Boca Raton, FL 33431 USA

来源：

NETWORK MODELING AND ANALYSIS IN HEALTH INFORMATICS AND BIOINFORMATICS | 2020年 / 9卷 / 01期

关键词：

Sample size determination; Big data; Limited labels; Learning curve; Class imbalance;

D O I：

10.1007/s13721-020-0218-0

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

The era of big data has produced vast amounts of information that can be used to build machine learning models. In many cases, however, there is a point where adding more data only marginally increases model performance. This is especially important for scenarios of limited labeled data, as annotation can be expensive and time consuming. If the required sample size for accurate model performance can be determined early, then resources can be allocated appropriately to minimize time and cost. In this study, we explore sample size determination methods for four real-world biomedical datasets, spanning genomics, proteomics, electronic health records, and insurance claims data, all with millions of instances each and<2% class ratio. The methods used involve approximating a learning curve for a large amount of data using a small amount of data. We evaluate an existing method that fits an inverse power law model to a small learning curve and introduce a novel semi-supervised method that utilizes the large amount of unlabeled data for estimating a learning curve. We find that the inverse power law method is applicable to big data, while the semi-supervised method can be better at detecting convergence. To the best of our knowledge, this is the first study to apply an inverse power law curve fitting method to big data with limited labels and compare it to a semi-supervised approach.

引用

页数：13

共 50 条

[1] Sample size determination for biomedical big data with limited labels
Aaron N. Richter
Taghi M. Khoshgoftaar
Network Modeling Analysis in Health Informatics and Bioinformatics, 2020, 9
[2] Approximating Learning Curves for Imbalanced Big Data with Limited Labels
Richter, Aaron N.
Khoshgoftaar, Taghi M.
2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 237 - 242
[3] Sentiment analysis on big sparse data streams with limited labels
Iosifidis, Vasileios
Ntoutsi, Eirini
KNOWLEDGE AND INFORMATION SYSTEMS, 2020, 62 (04) : 1393 - 1432
[4] Sentiment analysis on big sparse data streams with limited labels
Vasileios Iosifidis
Eirini Ntoutsi
Knowledge and Information Systems, 2020, 62 : 1393 - 1432
[5] Sample size determination for multidimensional parameters and the A-optimal subsampling in a big data linear regression model
Zhang, Sheng
Tan, Fei
Peng, Hanxiang
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2025, 95 (03) : 628 - 653
[6] Sample size determination for mediation analysis of longitudinal data
Pan, Haitao
Liu, Suyu
Miao, Danmin
Yuan, Ying
BMC MEDICAL RESEARCH METHODOLOGY, 2018, 18
[7] Sharing big biomedical data
Toga A.W.
Dinov I.D.
Journal of Big Data, 2015, 2 (01)
[8] Big Data and Large Sample Size: A Cautionary Note on the Potential for Bias
Kaplan, Robert M.
Chambers, David A.
Glasgow, Russell E.
CTS-CLINICAL AND TRANSLATIONAL SCIENCE, 2014, 7 (04): : 342 - 346
[9] The Ethics of Biomedical Big Data
Mason, Paul H.
JOURNAL OF BIOETHICAL INQUIRY, 2017, 14 (04) : 571 - 574
[10] Sample size determination for mediation analysis of longitudinal data
Haitao Pan
Suyu Liu
Danmin Miao
Ying Yuan
BMC Medical Research Methodology, 18

← 1 2 3 4 5 →