A supervised machine learning workflow for the reduction of highly dimensional biological data

被引：2

作者：

Andersen, Linnea K. ^{[1
]}

Reading, Benjamin J. ^{[1
,2
]}

机构：

[1] North Carolina State Univ, Dept Appl Ecol Raleigh, Raleigh, NC USA

[2] North Carolina State Univ, Pamlico Aquaculture Field Lab, Aurora, NC USA

来源：

ARTIFICIAL INTELLIGENCE IN THE LIFE SCIENCES | 2024年 / 5卷

基金：

美国农业部; 美国海洋和大气管理局;

关键词：

Machine learning; Supervised learning; Data analysis; Dimensionality reduction; Biological data; Genomics; DYNAMICS;

D O I：

10.1016/j.ailsci.2023.100090

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

Recent technological advancements have revolutionized research capabilities across the biological sciences by enabling the collection of large data that provides a broader picture of systems from the cellular to ecosystem level at a more refined resolution. The rapid rate of generating these data has exacerbated bottlenecks in study design and data analysis approaches, especially as conventional methods that incorporate traditional statistical tests and assumptions are not suitable or sufficient for highly dimensional data (i.e., more than 1,000 variables). The application of machine learning techniques in large data analysis is one promising solution that is increasingly popular. However, limitations in expertise such that the results from machine learning models can be interpreted to gain meaningful biological insight pose a great challenge. To address this challenge, a userfriendly machine learning workflow that can be applied to a wide variety of data types to reduce these large data to those variables (attributes) most determinant of experimental and/or observed conditions is provided, as well as a general overview of data analysis and machine learning approaches and considerations thereof. The workflow presented here has been beta-tested with great success and is recommended to be incorporated into analysis pipelines of large data as a standardized approach to reduce data dimensionality. Moreover, the workflow is flexible, and the underlying concepts and steps can be modified to best suit user needs, objectives, and study parameters.

引用

页数：11

共 67 条

[1] Feature Selection Using Information Gain for Improved Structural-Based Alert Correlation
Alhaj, Taqwa Ahmed
Siraj, Maheyzah Md
Zainal, Anazida
Elshoush, Huwaida Tagelsir
Elhaj, Fatin
[J]. PLOS ONE, 2016, 11 (11):
[2] The curse(s) of dimensionality
Altman, Naomi
Krzywinski, Martin
[J]. NATURE METHODS, 2018, 15 (06) : 399 - 400
[3] Amrehn M., 2018, ARXIV
[4] Null hypothesis testing: Problems, prevalence, and an alternative
Anderson, DR
Burnham, KP
Thompson, WL
[J]. JOURNAL OF WILDLIFE MANAGEMENT, 2000, 64 (04) : 912 - 923
[5] Bashura J, Threats to food and agriculture resources
[6] Bhargava N., 2013, INT J ADV RES COMPUT, V3, P1114
[7] Statistical modeling: The two cultures
Breiman, L
[J]. STATISTICAL SCIENCE, 2001, 16 (03) : 199 - 215
[8] POINTS OF SIGNIFICANCE Machine learning: a primer
Bzdok, Danilo
Krzywinski, Martin
Altman, Naomi
[J]. NATURE METHODS, 2017, 14 (12) : 1119 - 1120
[9] Ovary Transcriptome Profiling via Artificial Intelligence Reveals a Transcriptomic Fingerprint Predicting Egg Quality in Striped Bass, Morone saxatilis
Chapman, Robert W.
Reading, Benjamin J.
Sullivan, Craig V.
[J]. PLOS ONE, 2014, 9 (05):
[10] How large a training set is needed to develop a classifier for microarray data?
Dobbin, Kevin K.
Zhao, Yingdong
Simon, Richard M.
[J]. CLINICAL CANCER RESEARCH, 2008, 14 (01) : 108 - 114

← 1 2 3 4 5 6 7 →