DeepCGP: A Deep Learning Method to Compress Genome-Wide Polymorphisms for Predicting Phenotype of Rice

被引:5
作者
Islam, Tanzila [1 ]
Kim, Chyon Hae [1 ]
Iwata, Hiroyoshi [2 ]
Hiroyuki, Shimono [3 ,4 ]
Kimura, Akio [1 ,4 ]
机构
[1] Iwate Univ, Grad Sch Sci & Engn, Dept Syst Innovat Engn, Morioka, Iwate 0208550, Japan
[2] Univ Tokyo, Dept Agr & Environm Biol, Bunkyo Ku, Tokyo 1130033, Japan
[3] Iwate Univ, Fac Agr, Crop Sci Lab, Morioka, Iwate 0208550, Japan
[4] Iwate Univ, Agri Innovat Ctr, Morioka, Iwate 0208550, Japan
基金
日本学术振兴会;
关键词
Bioinformatics; Genomics; Data models; Predictive models; Mathematical models; Deep learning; Radio frequency; autoencoder; genomic selection; data compression; genomic prediction; BREEDING TECHNOLOGIES; FOOD SECURITY; REGRESSION; SELECTION;
D O I
10.1109/TCBB.2022.3231466
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Genomic selection (GS) is expected to accelerate plant and animal breeding. During the last decade, genome-wide polymorphism data have increased, which has raised concerns about storage cost and computational time. Several individual studies have attempted to compress the genome data and predict phenotypes. However, compression models lack adequate quality of data after compression, and prediction models are time consuming and use original data to predict the phenotype. Therefore, a combined application of compression and genomic prediction modeling using deep learning could resolve these limitations. A Deep Learning Compression-based Genomic Prediction (DeepCGP) model that can compress genome-wide polymorphism data and predict phenotypes of a target trait from compressed information was proposed. The DeepCGP model contained two parts: (i) an autoencoder model based on deep neural networks to compress genome-wide polymorphism data, and (ii) regression models based on random forests (RF), genomic best linear unbiased prediction (GBLUP), and Bayesian variable selection (BayesB) to predict phenotypes from compressed information. Two datasets with genome-wide marker genotypes and target trait phenotypes in rice were applied. The DeepCGP model obtained up to 99% prediction accuracy to the maximum for a trait after 98% compression. BayesB required extensive computational time among the three methods, and showed the highest accuracy; however, BayesB could only be used with compressed data. Overall, DeepCGP outperformed state-of-the-art methods in terms of both compression and prediction. Our code and data are available at https://github.com/tanzilamohita/DeepCGP.
引用
收藏
页码:2078 / 2088
页数:11
相关论文
共 50 条
  • [1] Deep learning versus parametric and ensemble methods for genomic prediction of complex phenotypes
    Abdollahi-Arpanahi, Rostam L.
    Gianola, Daniel
    Penagaricano, Francisco
    [J]. GENETICS SELECTION EVOLUTION, 2020, 52 (01)
  • [2] A Fast Reference-Free Genome Compression Using Deep Neural Networks
    Absardi, Zeinab Nazemi
    Javidan, Reza
    [J]. 2019 BIG DATA, KNOWLEDGE AND CONTROL SYSTEMS ENGINEERING (BDKCSE), 2019,
  • [3] Bhukya Raju, 2020, Information and Communication Technology for Sustainable Development. Proceedings of ICT4SD 2018. Advances in Intelligent Systems and Computing (AISC 933), P615, DOI 10.1007/978-981-13-7166-0_61
  • [4] A Ranking Approach to Genomic Selection
    Blondel, Mathieu
    Onogi, Akio
    Iwata, Hiroyoshi
    Ueda, Naonori
    [J]. PLOS ONE, 2015, 10 (06):
  • [5] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [6] Random forests for genomic data analysis
    Chen, Xi
    Ishwaran, Hemant
    [J]. GENOMICS, 2012, 99 (06) : 323 - 329
  • [7] Challenges of Big Data analysis
    Fan, Jianqing
    Han, Fang
    Liu, Han
    [J]. NATIONAL SCIENCE REVIEW, 2014, 1 (02) : 293 - 314
  • [8] Priors in Whole-Genome Regression: The Bayesian Alphabet Returns
    Gianola, Daniel
    [J]. GENETICS, 2013, 194 (03) : 573 - 596
  • [9] Additive Genetic Variability and the Bayesian Alphabet
    Gianola, Daniel
    de los Campos, Gustavo
    Hill, William G.
    Manfredi, Eduardo
    Fernando, Rohan
    [J]. GENETICS, 2009, 183 (01) : 347 - 363
  • [10] Machine learning methods and predictive ability metrics for genome-wide prediction of complex traits
    Gonzalez-Recio, Oscar
    Rosa, Guilherme J. M.
    Gianola, Daniel
    [J]. LIVESTOCK SCIENCE, 2014, 166 : 217 - 231