A supervised machine learning workflow for the reduction of highly dimensional biological data

被引:2
作者
Andersen, Linnea K. [1 ]
Reading, Benjamin J. [1 ,2 ]
机构
[1] North Carolina State Univ, Dept Appl Ecol Raleigh, Raleigh, NC USA
[2] North Carolina State Univ, Pamlico Aquaculture Field Lab, Aurora, NC USA
来源
ARTIFICIAL INTELLIGENCE IN THE LIFE SCIENCES | 2024年 / 5卷
基金
美国农业部; 美国海洋和大气管理局;
关键词
Machine learning; Supervised learning; Data analysis; Dimensionality reduction; Biological data; Genomics; DYNAMICS;
D O I
10.1016/j.ailsci.2023.100090
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Recent technological advancements have revolutionized research capabilities across the biological sciences by enabling the collection of large data that provides a broader picture of systems from the cellular to ecosystem level at a more refined resolution. The rapid rate of generating these data has exacerbated bottlenecks in study design and data analysis approaches, especially as conventional methods that incorporate traditional statistical tests and assumptions are not suitable or sufficient for highly dimensional data (i.e., more than 1,000 variables). The application of machine learning techniques in large data analysis is one promising solution that is increasingly popular. However, limitations in expertise such that the results from machine learning models can be interpreted to gain meaningful biological insight pose a great challenge. To address this challenge, a userfriendly machine learning workflow that can be applied to a wide variety of data types to reduce these large data to those variables (attributes) most determinant of experimental and/or observed conditions is provided, as well as a general overview of data analysis and machine learning approaches and considerations thereof. The workflow presented here has been beta-tested with great success and is recommended to be incorporated into analysis pipelines of large data as a standardized approach to reduce data dimensionality. Moreover, the workflow is flexible, and the underlying concepts and steps can be modified to best suit user needs, objectives, and study parameters.
引用
收藏
页数:11
相关论文
共 67 条
  • [1] Feature Selection Using Information Gain for Improved Structural-Based Alert Correlation
    Alhaj, Taqwa Ahmed
    Siraj, Maheyzah Md
    Zainal, Anazida
    Elshoush, Huwaida Tagelsir
    Elhaj, Fatin
    [J]. PLOS ONE, 2016, 11 (11):
  • [2] The curse(s) of dimensionality
    Altman, Naomi
    Krzywinski, Martin
    [J]. NATURE METHODS, 2018, 15 (06) : 399 - 400
  • [3] Amrehn M., 2018, ARXIV
  • [4] Null hypothesis testing: Problems, prevalence, and an alternative
    Anderson, DR
    Burnham, KP
    Thompson, WL
    [J]. JOURNAL OF WILDLIFE MANAGEMENT, 2000, 64 (04) : 912 - 923
  • [5] Bashura J, Threats to food and agriculture resources
  • [6] Bhargava N., 2013, INT J ADV RES COMPUT, V3, P1114
  • [7] Statistical modeling: The two cultures
    Breiman, L
    [J]. STATISTICAL SCIENCE, 2001, 16 (03) : 199 - 215
  • [8] POINTS OF SIGNIFICANCE Machine learning: a primer
    Bzdok, Danilo
    Krzywinski, Martin
    Altman, Naomi
    [J]. NATURE METHODS, 2017, 14 (12) : 1119 - 1120
  • [9] Ovary Transcriptome Profiling via Artificial Intelligence Reveals a Transcriptomic Fingerprint Predicting Egg Quality in Striped Bass, Morone saxatilis
    Chapman, Robert W.
    Reading, Benjamin J.
    Sullivan, Craig V.
    [J]. PLOS ONE, 2014, 9 (05):
  • [10] How large a training set is needed to develop a classifier for microarray data?
    Dobbin, Kevin K.
    Zhao, Yingdong
    Simon, Richard M.
    [J]. CLINICAL CANCER RESEARCH, 2008, 14 (01) : 108 - 114