Sample size determination for biomedical big data with limited labels

被引:8
|
作者
Richter, Aaron N. [1 ]
Khoshgoftaar, Taghi M. [1 ]
机构
[1] Florida Atlantic Univ, Dept Comp & Elect Engn & Comp Sci, 777 Glades Rd, Boca Raton, FL 33431 USA
来源
NETWORK MODELING AND ANALYSIS IN HEALTH INFORMATICS AND BIOINFORMATICS | 2020年 / 9卷 / 01期
关键词
Sample size determination; Big data; Limited labels; Learning curve; Class imbalance;
D O I
10.1007/s13721-020-0218-0
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
The era of big data has produced vast amounts of information that can be used to build machine learning models. In many cases, however, there is a point where adding more data only marginally increases model performance. This is especially important for scenarios of limited labeled data, as annotation can be expensive and time consuming. If the required sample size for accurate model performance can be determined early, then resources can be allocated appropriately to minimize time and cost. In this study, we explore sample size determination methods for four real-world biomedical datasets, spanning genomics, proteomics, electronic health records, and insurance claims data, all with millions of instances each and<2% class ratio. The methods used involve approximating a learning curve for a large amount of data using a small amount of data. We evaluate an existing method that fits an inverse power law model to a small learning curve and introduce a novel semi-supervised method that utilizes the large amount of unlabeled data for estimating a learning curve. We find that the inverse power law method is applicable to big data, while the semi-supervised method can be better at detecting convergence. To the best of our knowledge, this is the first study to apply an inverse power law curve fitting method to big data with limited labels and compare it to a semi-supervised approach.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Sample size determination for biomedical big data with limited labels
    Aaron N. Richter
    Taghi M. Khoshgoftaar
    Network Modeling Analysis in Health Informatics and Bioinformatics, 2020, 9
  • [2] Approximating Learning Curves for Imbalanced Big Data with Limited Labels
    Richter, Aaron N.
    Khoshgoftaar, Taghi M.
    2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 237 - 242
  • [3] Sentiment analysis on big sparse data streams with limited labels
    Iosifidis, Vasileios
    Ntoutsi, Eirini
    KNOWLEDGE AND INFORMATION SYSTEMS, 2020, 62 (04) : 1393 - 1432
  • [4] Sentiment analysis on big sparse data streams with limited labels
    Vasileios Iosifidis
    Eirini Ntoutsi
    Knowledge and Information Systems, 2020, 62 : 1393 - 1432
  • [5] Sample size determination for multidimensional parameters and the A-optimal subsampling in a big data linear regression model
    Zhang, Sheng
    Tan, Fei
    Peng, Hanxiang
    JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2025, 95 (03) : 628 - 653
  • [6] Sample size determination for mediation analysis of longitudinal data
    Pan, Haitao
    Liu, Suyu
    Miao, Danmin
    Yuan, Ying
    BMC MEDICAL RESEARCH METHODOLOGY, 2018, 18
  • [7] Sharing big biomedical data
    Toga A.W.
    Dinov I.D.
    Journal of Big Data, 2015, 2 (01)
  • [8] Big Data and Large Sample Size: A Cautionary Note on the Potential for Bias
    Kaplan, Robert M.
    Chambers, David A.
    Glasgow, Russell E.
    CTS-CLINICAL AND TRANSLATIONAL SCIENCE, 2014, 7 (04): : 342 - 346
  • [9] The Ethics of Biomedical Big Data
    Mason, Paul H.
    JOURNAL OF BIOETHICAL INQUIRY, 2017, 14 (04) : 571 - 574
  • [10] Sample size determination for mediation analysis of longitudinal data
    Haitao Pan
    Suyu Liu
    Danmin Miao
    Ying Yuan
    BMC Medical Research Methodology, 18