Quartet Based Gene Tree Imputation Using Deep Learning Improves Phylogenomic Analyses Despite Missing Data

被引:4
|
作者
Mahbub, Sazan [1 ,2 ]
Sawmya, Shashata [1 ]
Saha, Arpita [1 ]
Reaz, Rezwana [1 ]
Rahman, M. Sohel [1 ]
Bayzid, Md. Shamsuzzoha [1 ,3 ]
机构
[1] Bangladesh Univ Engn & Technol, Dept Comp Sci & Engn, Dhaka, Bangladesh
[2] Univ Maryland, Dept Comp Sci, College Pk, MD USA
[3] Bangladesh Univ Engn & Technol, Dept Comp Sci & Engn, ECE Bldg, Dhaka 1205, Bangladesh
关键词
gene tree; gene tree discordance; incomplete lineage sorting; quartet consistency; quartet distribution; species tree; missing data; gene tree imputation; SPECIES TREES; MAXIMUM-LIKELIHOOD; COALESCENT; INFERENCE; CONCATENATION; PROBABILITY; CONCORDANCE; ROOT;
D O I
10.1089/cmb.2022.0212
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, for a combination of reasons (ranging from sampling biases to more biological causes, as in gene birth and loss), gene trees are often incomplete, meaning that not all species of interest have a common set of genes. Incomplete gene trees can potentially impact the accuracy of phylogenomic inference. We, for the first time, introduce the problem of imputing the quartet distribution induced by a set of incomplete gene trees, which involves adding the missing quartets back to the quartet distribution. We present Quartet based Gene tree Imputation using Deep Learning (QT-GILD), an automated and specially tailored unsupervised deep learning technique, accompanied by cues from natural language processing, which learns the quartet distribution in a given set of incomplete gene trees and generates a complete set of quartets accordingly. QT-GILD is a general-purpose technique needing no explicit modeling of the subject system or reasons for missing data or gene tree heterogeneity. Experimental studies on a collection of simulated and empirical datasets suggest that QT-GILD can effectively impute the quartet distribution, which results in a dramatic improvement in the species tree accuracy. Remarkably, QT-GILD not only imputes the missing quartets but can also account for gene tree estimation error. Therefore, QT-GILD advances the state-of-the-art in species tree estimation from gene trees in the face of missing data.
引用
收藏
页码:1156 / 1172
页数:17
相关论文
共 50 条
  • [21] Are deep learning models superior for missing data imputation in surveys? Evidence from an empirical comparison
    Wang, Zhenhua
    Akande, Olanrewaju
    Poulos, Jason
    Li, Fan
    SURVEY METHODOLOGY, 2022, 48 (02) : 375 - 399
  • [22] Transformers deep learning models for missing data imputation: an application of the ReMasker model on a psychometric scale
    Casella, Monica
    Milano, Nicola
    Dolce, Pasquale
    Marocco, Davide
    FRONTIERS IN PSYCHOLOGY, 2024, 15
  • [23] Reference-based multiple imputation for missing data sensitivity analyses in trial-based cost-effectiveness analysis
    Leurent, Baptiste
    Gomes, Manuel
    Cro, Suzie
    Wiles, Nicola
    Carpenter, James R.
    HEALTH ECONOMICS, 2020, 29 (02) : 171 - 184
  • [24] Evaluating a sequential tree-based procedure for multivariate imputation of complex missing data structures
    Borgoni, Riccardo
    Berrington, Ann
    QUALITY & QUANTITY, 2013, 47 (04) : 1991 - 2008
  • [25] Evaluating a sequential tree-based procedure for multivariate imputation of complex missing data structures
    Riccardo Borgoni
    Ann Berrington
    Quality & Quantity, 2013, 47 : 1991 - 2008
  • [26] Data-driven missing data imputation in cluster monitoring system based on deep neural network
    Lin, Jie
    Li, NianHua
    Alam, Md Ashraful
    Ma, Yuqing
    APPLIED INTELLIGENCE, 2020, 50 (03) : 860 - 877
  • [27] Data-driven missing data imputation in cluster monitoring system based on deep neural network
    Jie Lin
    NianHua Li
    Md Ashraful Alam
    Yuqing Ma
    Applied Intelligence, 2020, 50 : 860 - 877
  • [28] Learning-Based Adaptive Imputation Method with kNN Algorithm for Missing Power Data
    Kim, Minkyung
    Park, Sangdon
    Lee, Joohyung
    Joo, Yongjae
    Choi, Jun Kyun
    ENERGIES, 2017, 10 (10)
  • [29] Missing data imputation using utility-based regression and sampling approaches
    Haliduola, Halimu N.
    Bretz, Frank
    Mansmann, Ulrich
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2022, 226
  • [30] Fault Diagnosis Based on Deep Learning Subject to Missing Data
    Liu, Weibo
    Wei, Dan
    Zhou, Funa
    PROCEEDINGS OF THE 30TH CHINESE CONTROL AND DECISION CONFERENCE (2018 CCDC), 2018, : 3972 - 3977