Sample size determination for biomedical big data with limited labels

被引:8
作者
Richter, Aaron N. [1 ]
Khoshgoftaar, Taghi M. [1 ]
机构
[1] Florida Atlantic Univ, Dept Comp & Elect Engn & Comp Sci, 777 Glades Rd, Boca Raton, FL 33431 USA
来源
NETWORK MODELING AND ANALYSIS IN HEALTH INFORMATICS AND BIOINFORMATICS | 2020年 / 9卷 / 01期
关键词
Sample size determination; Big data; Limited labels; Learning curve; Class imbalance;
D O I
10.1007/s13721-020-0218-0
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
The era of big data has produced vast amounts of information that can be used to build machine learning models. In many cases, however, there is a point where adding more data only marginally increases model performance. This is especially important for scenarios of limited labeled data, as annotation can be expensive and time consuming. If the required sample size for accurate model performance can be determined early, then resources can be allocated appropriately to minimize time and cost. In this study, we explore sample size determination methods for four real-world biomedical datasets, spanning genomics, proteomics, electronic health records, and insurance claims data, all with millions of instances each and<2% class ratio. The methods used involve approximating a learning curve for a large amount of data using a small amount of data. We evaluate an existing method that fits an inverse power law model to a small learning curve and introduce a novel semi-supervised method that utilizes the large amount of unlabeled data for estimating a learning curve. We find that the inverse power law method is applicable to big data, while the semi-supervised method can be better at detecting convergence. To the best of our knowledge, this is the first study to apply an inverse power law curve fitting method to big data with limited labels and compare it to a semi-supervised approach.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] Machine Learning and Integrative Analysis of Biomedical Big Data
    Mirza, Bilal
    Wang, Wei
    Wang, Jie
    Choi, Howard
    Chung, Neo Christopher
    Ping, Peipei
    GENES, 2019, 10 (02)
  • [22] Big Data and Machine Learning as Tools for the Biomedical Field
    Anguita-Ruiz, A.
    Aguilera, C. M.
    Torres-Martos, A.
    Bustos, M.
    Ruiz-Ojeda, F. J.
    Alcala-Fdez, J.
    ANNALS OF NUTRITION AND METABOLISM, 2023, 79 (01)
  • [23] Big data, medical language and biomedical terminology systems
    Schulz, Stefan
    Lopez-Garcia, Pablo
    BUNDESGESUNDHEITSBLATT-GESUNDHEITSFORSCHUNG-GESUNDHEITSSCHUTZ, 2015, 58 (08) : 844 - 852
  • [24] Exploring and cleaning big data with random sample data blocks
    Salloum, Salman
    Huang, Joshua Zhexue
    He, Yulin
    JOURNAL OF BIG DATA, 2019, 6 (01)
  • [25] Exploring and cleaning big data with random sample data blocks
    Salman Salloum
    Joshua Zhexue Huang
    Yulin He
    Journal of Big Data, 6
  • [26] Data-Centric Methods for Environmental Sound Classification With Limited Labels
    Syed, Ali Raza
    Coban, Enis Berk
    Pir, Dara
    Mandel, Michael
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4288 - 4297
  • [27] Big Data for Policymaking: Great Expectations, but with Limited Progress?
    Poel, Martijn
    Meyer, Eric T.
    Schroeder, Ralph
    POLICY AND INTERNET, 2018, 10 (03): : 347 - 367
  • [28] Sample Size Determination for Comparing Tail Probabilities
    Lee, Ji-An
    Song, Hae-Hiang
    KOREAN JOURNAL OF APPLIED STATISTICS, 2007, 20 (01) : 183 - 194
  • [29] Big data technology in health and biomedical research: A literature review
    Dewangan R.R.
    Thombre D.
    Patel C.
    1600, Science and Engineering Research Support Society (09): : 175 - 184
  • [30] The Ethics of Big Data: Current and Foreseeable Issues in Biomedical Contexts
    Mittelstadt, Brent Daniel
    Floridi, Luciano
    SCIENCE AND ENGINEERING ETHICS, 2016, 22 (02) : 303 - 341