Integration of multi-omics data for prediction of phenotypic traits using random forest

被引：70

作者：

Acharjee, Animesh ^{[1
,3
]}

Kloosterman, Bjorn ^{[1
,2
]}

Visser, Richard G. F. ^{[1
]}

Maliepaard, Chris ^{[1
]}

机构：

[1] Univ Wageningen & Res Ctr, Wageningen UR Plant Breeding, NL-6700 AJ Wageningen, Netherlands

[2] Keygene NV, POB 216, NL-6700 AE Wageningen, Netherlands

[3] MRC Human Nutr Res, 120 Fulbourn Rd, Cambridge CB1 9NL, England

来源：

BMC BIOINFORMATICS | 2016年 / 17卷

关键词：

Data integration; Genetical genomics; Networks; Random forest; GENETIC GENOMICS; POTATO; EXPRESSION; QTL; RNA;

D O I：

10.1186/s12859-016-1043-4

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: In order to find genetic and metabolic pathways related to phenotypic traits of interest, we analyzed gene expression data, metabolite data obtained with GC-MS and LC-MS, proteomics data and a selected set of tuber quality phenotypic data from a diploid segregating mapping population of potato. In this study we present an approach to integrate these similar to omics data sets for the purpose of predicting phenotypic traits. This gives us networks of relatively small sets of interrelated similar to omics variables that can predict, with higher accuracy, a quality trait of interest. Results: We used Random Forest regression for integrating multiple similar to omics data for prediction of four quality traits of potato: tuber flesh colour, DSC onset, tuber shape and enzymatic discoloration. For tuber flesh colour beta-carotene hydroxylase and zeaxanthin epoxidase were ranked first and forty-fourth respectively both of which have previously been associated with flesh colour in potato tubers. Combining all the significant genes, LC-peaks, GC-peaks and proteins, the variation explained was 75 %, only slightly more than what gene expression or LC-MS data explain by themselves which indicates that there are correlations among the variables across data sets. For tuber shape regressed on the gene expression, LC-MS, GC-MS and proteomics data sets separately, only gene expression data was found to explain significant variation. For DSC onset, we found 12 significant gene expression, 5 metabolite levels (GC) and 2 proteins that are associated with the trait. Using those 19 significant variables, the variation explained was 45 %. Expression QTL (eQTL) analyses showed many associations with genomic regions in chromosome 2 with also the highest explained variation compared to other chromosomes. Transcriptomics and metabolomics analysis on enzymatic discoloration after 5 min resulted in 420 significant genes and 8 significant LC metabolites, among which two were putatively identified as caffeoylquinic acid methyl ester and tyrosine. Conclusions: In this study, we made a strategy for selecting and integrating multiple similar to omics data using random forest method and selected representative individual peaks for networks based on eQTL, mQTL or pQTL information. Network analysis was done to interpret how a particular trait is associated with gene expression, metabolite and protein data.

引用

页数：11

共 50 条

[21] Identification of ovarian cancer driver genes by using module network integration of multi-omics data
Gevaert, Olivier
Villalobos, Victor
Sikic, Branimir I.
Plevritis, Sylvia K.
[J]. INTERFACE FOCUS, 2013, 3 (04)
[22] Integration of incomplete multi-omics data using Knowledge Distillation and Supervised Variational Autoencoders for disease progression prediction
Ranjbari, Sima
Arslanturk, Suzan
[J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2023, 147
[23] Integration of Multi-Omics Data Using Probabilistic Graph Models and External Knowledge
Tripp, Bridget A.
Otu, Hasan H.
[J]. CURRENT BIOINFORMATICS, 2022, 17 (01) : 37 - 47
[24] A comprehensive survey of the approaches for pathway analysis using multi-omics data integration
Maghsoudi, Zeynab
Nguyen, Ha
Tavakkoli, Alireza
Nguyen, Tin
[J]. BRIEFINGS IN BIOINFORMATICS, 2022, 23 (06)
[25] A multi-omics graph database for data integration and knowledge extraction
Kim, Suyeon
Thapa, Ishwor
Ali, Hesham
[J]. 13TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND HEALTH INFORMATICS, BCB 2022, 2022,
[26] Intricacies of single-cell multi-omics data integration
Rautenstrauch, Pia
Vlot, Anna Hendrika Cornelia
Saran, Sepideh
Ohler, Uwe
[J]. TRENDS IN GENETICS, 2022, 38 (02) : 128 - 139
[27] Integration strategies of multi-omics data for machine learning analysis
Picard, Milan
Scott-Boyer, Marie -Pier
Bodein, Antoine
Perin, Olivier
Droit, Arnaud
[J]. COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2021, 19 : 3735 - 3746
[28] Directional integration and pathway enrichment analysis for multi-omics data
Slobodyanyuk, Mykhaylo
Bahcheli, Alexander T.
Klein, Zoe P.
Bayati, Masroor
Strug, Lisa J.
Reimand, Juri
[J]. NATURE COMMUNICATIONS, 2024, 15 (01)
[29] Benchmarking algorithms for single-cell multi-omics prediction and integration
Hu, Yinlei
Wan, Siyuan
Luo, Yuanhanyu
Li, Yuanzhe
Wu, Tong
Deng, Wentao
Jiang, Chen
Jiang, Shan
Zhang, Yueping
Liu, Nianping
Yang, Zongcheng
Chen, Falai
Li, Bin
Qu, Kun
[J]. NATURE METHODS, 2024, 21 (11) : 2182 - +
[30] Evaluating the performance of multi-omics integration: a thyroid toxicity case study
Canzler, Sebastian
Schubert, Kristin
Rolle-Kampczyk, Ulrike E.
Wang, Zhipeng
Schreiber, Stephan
Seitz, Herve
Mockly, Sophie
Kamp, Hennicke
Haake, Volker
Huisinga, Maike
Bergen, Martin von
Buesen, Roland
Hackermueller, Joerg
[J]. ARCHIVES OF TOXICOLOGY, 2025, 99 (01) : 309 - 332

← 1 2 3 4 5 →