BioM2: biologically informed multi-stage machine learning for phenotype prediction using omics data

被引:0
|
作者
Zhang, Shunjie [1 ]
Li, Pan [2 ]
Wang, Shenghan [2 ]
Zhu, Jijun [2 ]
Huang, Zhongting [2 ]
Cai, Fuqiang
Freidel, Sebastian [4 ]
Ling, Fei [1 ,2 ]
Schwarz, Emanuel [3 ,4 ]
Chen, Junfang [2 ,5 ]
机构
[1] South China Univ Technol, Sch Biol & Biol Engn, Guangzhou, Peoples R China
[2] Fudan Univ, Greater Bay Area Inst Precis Med Guangzhou, Ctr Intelligent Med, Sch Life Sci, 6,2nd Nanjiang Rd, Guangzhou 511462, Peoples R China
[3] Heidelberg Univ, Hector Inst Artificial Intelligence Psychiat, Med Fac Mannheim, Cent Inst Mental Hlth, M7, D-68161 Mannheim, Germany
[4] Heidelberg Univ, Cent Inst Mental Hlth, Med Fac, Dept Psychiat & Psychotherapy, J5, D-68159 Mannheim, Germany
[5] Fudan Univ, Ctr Evolutionary Biol, Sch Life Sci, Shanghai, Peoples R China
关键词
BioM2; machine learning; phenotype prediction; DNA methylome; transcriptome; Gene Ontology; EXPRESSION; BRAIN; PATHWAY;
D O I
10.1093/bib/bbae384
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Navigating the complex landscape of high-dimensional omics data with machine learning models presents a significant challenge. The integration of biological domain knowledge into these models has shown promise in creating more meaningful stratifications of predictor variables, leading to algorithms that are both more accurate and generalizable. However, the wider availability of machine learning tools capable of incorporating such biological knowledge remains limited. Addressing this gap, we introduce BioM2, a novel R package designed for biologically informed multistage machine learning. BioM2 uniquely leverages biological information to effectively stratify and aggregate high-dimensional biological data in the context of machine learning. Demonstrating its utility with genome-wide DNA methylation and transcriptome-wide gene expression data, BioM2 has shown to enhance predictive performance, surpassing traditional machine learning models that operate without the integration of biological knowledge. A key feature of BioM2 is its ability to rank predictor variables within biological categories, specifically Gene Ontology pathways. This functionality not only aids in the interpretability of the results but also enables a subsequent modular network analysis of these variables, shedding light on the intricate systems-level biology underpinning the predictive outcome. We have proposed a biologically informed multistage machine learning framework termed BioM2 for phenotype prediction based on omics data. BioM2 has been incorporated into the BioM2 CRAN package (https://cran.r-project.org/web/packages/BioM2/index.html).
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Digital twin-centered hybrid data-driven multi-stage deep learning framework for enhanced nuclear reactor power prediction
    Daniell, James
    Kobayashi, Kazuma
    Alajo, Ayodeji
    Alam, Syed Bahauddin
    ENERGY AND AI, 2025, 19
  • [32] Oil palm yield prediction across blocks from multi-source data using machine learning and deep learning
    Ang, Yuhao
    Shafri, Helmi Zulhaidi Mohd
    Lee, Yang Ping
    Bakar, Shahrul Azman
    Abidin, Haryati
    Junaidi, Mohd Umar Ubaydah Mohd
    Hashim, Shaiful Jahari
    Che'Ya, Nik Norasma
    Hassan, Mohd Roshdi
    San Lim, Hwee
    Abdullah, Rosni
    Yusup, Yusri
    Muhammad, Syahidah Akmal
    Teh, Sin Yin
    Samad, Mohd Na'aim
    EARTH SCIENCE INFORMATICS, 2022, 15 (04) : 2349 - 2367
  • [33] Human Multi-omics Data Pre-processing for Predictive Purposes Using Machine Learning: A Case Study in Childhood Obesity
    Torres-Martos, Alvaro
    Anguita-Ruiz, Augusto
    Bustos-Aibar, Mireia
    Camara-Sanchez, Sofia
    Alcala, Rafael
    Aguilera, Concepcion M.
    Alcala-Fdez, Jesus
    BIOINFORMATICS AND BIOMEDICAL ENGINEERING, PT II, 2022, : 359 - 374
  • [34] Oil palm yield prediction across blocks from multi-source data using machine learning and deep learning
    Yuhao Ang
    Helmi Zulhaidi Mohd Shafri
    Yang Ping Lee
    Shahrul Azman Bakar
    Haryati Abidin
    Mohd Umar Ubaydah Mohd Junaidi
    Shaiful Jahari Hashim
    Nik Norasma Che’Ya
    Mohd Roshdi Hassan
    Hwee San Lim
    Rosni Abdullah
    Yusri Yusup
    Syahidah Akmal Muhammad
    Sin Yin Teh
    Mohd Na’aim Samad
    Earth Science Informatics, 2022, 15 : 2349 - 2367
  • [35] Using Machine Learning on V2X Communications Data for VRU Collision Prediction
    Ribeiro, Bruno
    Nicolau, Maria Joao
    Santos, Alexandre
    SENSORS, 2023, 23 (03)
  • [36] MetaCancer: A deep learning-based pan-cancer metastasis prediction model developed using multi-omics data
    Albaradei, Somayah
    Napolitano, Francesco
    Thafar, Maha A.
    Gojobori, Takashi
    Essack, Magbubah
    Gao, Xin
    COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2021, 19 : 4404 - 4411
  • [37] Strategies to develop radiomics and machine learning models for lung cancer stage and histology prediction using small data samples
    Ubaldi, L.
    Valenti, V.
    Borgese, R. F.
    Collura, G.
    Fantacci, M. E.
    Ferrera, G.
    Iacoviello, G.
    Abbate, B. F.
    Laruina, F.
    Tripoli, A.
    Retico, A.
    Marrale, M.
    PHYSICA MEDICA-EUROPEAN JOURNAL OF MEDICAL PHYSICS, 2021, 90 : 13 - 22
  • [38] A low-cost approach for soil moisture prediction using multi-sensor data and machine learning algorithm
    Nguyen, Thu Thuy
    Ngo, Huu Hao
    Guo, Wenshan
    Chang, Soon Woong
    Nguyen, Dinh Duc
    Nguyen, Chi Trung
    Zhang, Jian
    Liang, Shuang
    Bui, Xuan Thanh
    Hoang, Ngoc Bich
    SCIENCE OF THE TOTAL ENVIRONMENT, 2022, 833
  • [39] Assessment of machine-learning methods for the prediction of STN using multi-source data in Fuzhou city, China
    Sodango, Terefe Hanchiso
    Sha, Jinming
    Li, Xiaomei
    Bao, Zhongcong
    REMOTE SENSING APPLICATIONS-SOCIETY AND ENVIRONMENT, 2023, 31
  • [40] learnMET: an R package to apply machine learning methods for genomic prediction using multi-environment trial data
    Westhues, Cathy C.
    Simianer, Henner
    Beissinger, Timothy M.
    G3-GENES GENOMES GENETICS, 2022, 12 (11):