Quartet Based Gene Tree Imputation Using Deep Learning Improves Phylogenomic Analyses Despite Missing Data

被引:4
|
作者
Mahbub, Sazan [1 ,2 ]
Sawmya, Shashata [1 ]
Saha, Arpita [1 ]
Reaz, Rezwana [1 ]
Rahman, M. Sohel [1 ]
Bayzid, Md. Shamsuzzoha [1 ,3 ]
机构
[1] Bangladesh Univ Engn & Technol, Dept Comp Sci & Engn, Dhaka, Bangladesh
[2] Univ Maryland, Dept Comp Sci, College Pk, MD USA
[3] Bangladesh Univ Engn & Technol, Dept Comp Sci & Engn, ECE Bldg, Dhaka 1205, Bangladesh
关键词
gene tree; gene tree discordance; incomplete lineage sorting; quartet consistency; quartet distribution; species tree; missing data; gene tree imputation; SPECIES TREES; MAXIMUM-LIKELIHOOD; COALESCENT; INFERENCE; CONCATENATION; PROBABILITY; CONCORDANCE; ROOT;
D O I
10.1089/cmb.2022.0212
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, for a combination of reasons (ranging from sampling biases to more biological causes, as in gene birth and loss), gene trees are often incomplete, meaning that not all species of interest have a common set of genes. Incomplete gene trees can potentially impact the accuracy of phylogenomic inference. We, for the first time, introduce the problem of imputing the quartet distribution induced by a set of incomplete gene trees, which involves adding the missing quartets back to the quartet distribution. We present Quartet based Gene tree Imputation using Deep Learning (QT-GILD), an automated and specially tailored unsupervised deep learning technique, accompanied by cues from natural language processing, which learns the quartet distribution in a given set of incomplete gene trees and generates a complete set of quartets accordingly. QT-GILD is a general-purpose technique needing no explicit modeling of the subject system or reasons for missing data or gene tree heterogeneity. Experimental studies on a collection of simulated and empirical datasets suggest that QT-GILD can effectively impute the quartet distribution, which results in a dramatic improvement in the species tree accuracy. Remarkably, QT-GILD not only imputes the missing quartets but can also account for gene tree estimation error. Therefore, QT-GILD advances the state-of-the-art in species tree estimation from gene trees in the face of missing data.
引用
收藏
页码:1156 / 1172
页数:17
相关论文
共 50 条
  • [31] Missing data imputation using statistical and machine learning methods in a real breast cancer problem
    Jerez, Jose M.
    Molina, Ignacio
    Garcia-Laencina, Pedro J.
    Alba, Emilio
    Ribelles, Nuria
    Martin, Miguel
    Franco, Leonardo
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2010, 50 (02) : 105 - 115
  • [32] Tree-based prediction on incomplete data using imputation or surrogate decisions
    Valdiviezo, H. Cevallos
    Van Aelst, S.
    INFORMATION SCIENCES, 2015, 311 : 163 - 181
  • [33] A bi-directional missing data imputation scheme based on LSTM and transfer learning for building energy data
    Ma, Jun
    Cheng, Jack C. P.
    Jiang, Feifeng
    Chen, Weiwei
    Wang, Mingzhu
    Zhai, Chong
    ENERGY AND BUILDINGS, 2020, 216
  • [34] MI2AMI: Missing Data Imputation Using Mixed Deep Gaussian Mixture Models
    Fuchs, Robin
    Pommeret, Denys
    Stocksieker, Samuel
    MACHINE LEARNING, OPTIMIZATION, AND DATA SCIENCE, LOD 2022, PT I, 2023, 13810 : 211 - 222
  • [35] Machine learning-based imputation soft computing approach for large missing scale and non-reference data imputation
    Alamoodi, A. H.
    Zaidan, B. B.
    Zaidan, A. . A. .
    Albahri, O. S.
    Chen, Juliana
    Chyad, M. A.
    Garfan, Salem
    Aleesa, A. M.
    CHAOS SOLITONS & FRACTALS, 2021, 151
  • [36] Categorical edge-based analyses of phylogenomic data reveal conflicting signals for difficult relationships in the avian tree
    Wang, Ning
    Braun, Edward L.
    Liang, Bin
    Cracraft, Joel
    Smith, Stephen A.
    MOLECULAR PHYLOGENETICS AND EVOLUTION, 2022, 174
  • [37] On analysis of longitudinal clinical trials with missing data using reference-based imputation
    Liu, G. Frank
    Pang, Lei
    JOURNAL OF BIOPHARMACEUTICAL STATISTICS, 2016, 26 (05) : 924 - 936
  • [38] Partitioned Gene-Tree Analyses and Gene-Based Topology Testing Help Resolve Incongruence in a Phylogenomic Study of Host-Specialist Bees (Apidae: Eucerinae)
    Freitas, Felipe, V
    Branstetter, Michael G.
    Griswold, Terry
    Almeida, Eduardo A. B.
    MOLECULAR BIOLOGY AND EVOLUTION, 2021, 38 (03) : 1090 - 1100
  • [39] Pavement Missing Condition Data Imputation through Collective Learning-Based Graph Neural Networks
    Yu, Ke
    Gao, Lu
    INTERNATIONAL CONFERENCE ON TRANSPORTATION AND DEVELOPMENT 2023: TRANSPORTATION PLANNING, OPERATIONS, AND TRANSIT, 2023, : 416 - 423
  • [40] A Deep Learning Based Data Recovery Approach for Missing and Erroneous Data of IoT Nodes
    Vedavalli, Perigisetty
    Ch, Deepak
    SENSORS, 2023, 23 (01)