Sample size determination for biomedical big data with limited labels

被引：8

作者：

Richter, Aaron N. ^{[1
]}

Khoshgoftaar, Taghi M. ^{[1
]}

机构：

[1] Florida Atlantic Univ, Dept Comp & Elect Engn & Comp Sci, 777 Glades Rd, Boca Raton, FL 33431 USA

来源：

NETWORK MODELING AND ANALYSIS IN HEALTH INFORMATICS AND BIOINFORMATICS | 2020年 / 9卷 / 01期

关键词：

Sample size determination; Big data; Limited labels; Learning curve; Class imbalance;

D O I：

10.1007/s13721-020-0218-0

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

The era of big data has produced vast amounts of information that can be used to build machine learning models. In many cases, however, there is a point where adding more data only marginally increases model performance. This is especially important for scenarios of limited labeled data, as annotation can be expensive and time consuming. If the required sample size for accurate model performance can be determined early, then resources can be allocated appropriately to minimize time and cost. In this study, we explore sample size determination methods for four real-world biomedical datasets, spanning genomics, proteomics, electronic health records, and insurance claims data, all with millions of instances each and<2% class ratio. The methods used involve approximating a learning curve for a large amount of data using a small amount of data. We evaluate an existing method that fits an inverse power law model to a small learning curve and introduce a novel semi-supervised method that utilizes the large amount of unlabeled data for estimating a learning curve. We find that the inverse power law method is applicable to big data, while the semi-supervised method can be better at detecting convergence. To the best of our knowledge, this is the first study to apply an inverse power law curve fitting method to big data with limited labels and compare it to a semi-supervised approach.

引用

页数：13

共 50 条

[21] Machine Learning and Integrative Analysis of Biomedical Big Data
Mirza, Bilal
Wang, Wei
Wang, Jie
Choi, Howard
Chung, Neo Christopher
Ping, Peipei
GENES, 2019, 10 (02)
[22] Big Data and Machine Learning as Tools for the Biomedical Field
Anguita-Ruiz, A.
Aguilera, C. M.
Torres-Martos, A.
Bustos, M.
Ruiz-Ojeda, F. J.
Alcala-Fdez, J.
ANNALS OF NUTRITION AND METABOLISM, 2023, 79 (01)
[23] Big data, medical language and biomedical terminology systems
Schulz, Stefan
Lopez-Garcia, Pablo
BUNDESGESUNDHEITSBLATT-GESUNDHEITSFORSCHUNG-GESUNDHEITSSCHUTZ, 2015, 58 (08) : 844 - 852
[24] Exploring and cleaning big data with random sample data blocks
Salloum, Salman
Huang, Joshua Zhexue
He, Yulin
JOURNAL OF BIG DATA, 2019, 6 (01)
[25] Exploring and cleaning big data with random sample data blocks
Salman Salloum
Joshua Zhexue Huang
Yulin He
Journal of Big Data, 6
[26] Data-Centric Methods for Environmental Sound Classification With Limited Labels
Syed, Ali Raza
Coban, Enis Berk
Pir, Dara
Mandel, Michael
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4288 - 4297
[27] Big Data for Policymaking: Great Expectations, but with Limited Progress?
Poel, Martijn
Meyer, Eric T.
Schroeder, Ralph
POLICY AND INTERNET, 2018, 10 (03): : 347 - 367
[28] Sample Size Determination for Comparing Tail Probabilities
Lee, Ji-An
Song, Hae-Hiang
KOREAN JOURNAL OF APPLIED STATISTICS, 2007, 20 (01) : 183 - 194
[29] Big data technology in health and biomedical research: A literature review
Dewangan R.R.
Thombre D.
Patel C.
1600, Science and Engineering Research Support Society (09): : 175 - 184
[30] The Ethics of Big Data: Current and Foreseeable Issues in Biomedical Contexts
Mittelstadt, Brent Daniel
Floridi, Luciano
SCIENCE AND ENGINEERING ETHICS, 2016, 22 (02) : 303 - 341

← 1 2 3 4 5 →