Quartet Based Gene Tree Imputation Using Deep Learning Improves Phylogenomic Analyses Despite Missing Data

被引:4
|
作者
Mahbub, Sazan [1 ,2 ]
Sawmya, Shashata [1 ]
Saha, Arpita [1 ]
Reaz, Rezwana [1 ]
Rahman, M. Sohel [1 ]
Bayzid, Md. Shamsuzzoha [1 ,3 ]
机构
[1] Bangladesh Univ Engn & Technol, Dept Comp Sci & Engn, Dhaka, Bangladesh
[2] Univ Maryland, Dept Comp Sci, College Pk, MD USA
[3] Bangladesh Univ Engn & Technol, Dept Comp Sci & Engn, ECE Bldg, Dhaka 1205, Bangladesh
关键词
gene tree; gene tree discordance; incomplete lineage sorting; quartet consistency; quartet distribution; species tree; missing data; gene tree imputation; SPECIES TREES; MAXIMUM-LIKELIHOOD; COALESCENT; INFERENCE; CONCATENATION; PROBABILITY; CONCORDANCE; ROOT;
D O I
10.1089/cmb.2022.0212
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, for a combination of reasons (ranging from sampling biases to more biological causes, as in gene birth and loss), gene trees are often incomplete, meaning that not all species of interest have a common set of genes. Incomplete gene trees can potentially impact the accuracy of phylogenomic inference. We, for the first time, introduce the problem of imputing the quartet distribution induced by a set of incomplete gene trees, which involves adding the missing quartets back to the quartet distribution. We present Quartet based Gene tree Imputation using Deep Learning (QT-GILD), an automated and specially tailored unsupervised deep learning technique, accompanied by cues from natural language processing, which learns the quartet distribution in a given set of incomplete gene trees and generates a complete set of quartets accordingly. QT-GILD is a general-purpose technique needing no explicit modeling of the subject system or reasons for missing data or gene tree heterogeneity. Experimental studies on a collection of simulated and empirical datasets suggest that QT-GILD can effectively impute the quartet distribution, which results in a dramatic improvement in the species tree accuracy. Remarkably, QT-GILD not only imputes the missing quartets but can also account for gene tree estimation error. Therefore, QT-GILD advances the state-of-the-art in species tree estimation from gene trees in the face of missing data.
引用
收藏
页码:1156 / 1172
页数:17
相关论文
共 50 条
  • [1] Phylogenomic analyses resolve an ancient trichotomy at the base of Ischyropsalidoidea (Arachnida, Opiliones) despite high levels of gene tree conflict and unequal minority resolution frequencies
    Richart, Casey H.
    Hayashi, Cheryl Y.
    Hedin, Marshal
    MOLECULAR PHYLOGENETICS AND EVOLUTION, 2016, 95 : 171 - 182
  • [2] Missing data incremental imputation through tree based methods
    Conversano, C
    Cappelli, C
    COMPSTAT 2002: PROCEEDINGS IN COMPUTATIONAL STATISTICS, 2002, : 455 - 460
  • [3] Analysis of Machine Learning Based Imputation of Missing Data
    Rizvi, Syed Tahir Hussain
    Latif, Muhammad Yasir
    Amin, Muhammad Saad
    Telmoudi, Achraf Jabeur
    Shah, Nasir Ali
    CYBERNETICS AND SYSTEMS, 2023,
  • [4] Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering
    Conversano, Claudio
    Siciliano, Roberta
    JOURNAL OF CLASSIFICATION, 2009, 26 (03) : 361 - 379
  • [5] Missing Data Imputation using Machine Learning Algorithm for Supervised Learning
    Cenitta, D.
    Arjunan, R. Vijaya
    Prema, K., V
    2021 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS (ICCCI), 2021,
  • [6] Missing data imputation model for dam health monitoring based on mode decomposition and deep learning
    Song, Jintao
    Yang, Zhaodi
    Li, Xinru
    JOURNAL OF CIVIL STRUCTURAL HEALTH MONITORING, 2024, 14 (05) : 1111 - 1124
  • [7] Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering
    Claudio Conversano
    Roberta Siciliano
    Journal of Classification, 2009, 26 : 361 - 379
  • [8] Missing Data Imputation based on Unsupervised Simple Competitive Learning
    Lee, Byoung Jik
    PROCEEDINGS OF THE 9TH WSEAS INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, KNOWLEDGE ENGINEERING AND DATA BASES, 2010, : 292 - +
  • [9] Weighting by Gene Tree Uncertainty Improves Accuracy of Quartet-based Species Trees
    Zhang, Chao
    Mirarab, Siavash
    MOLECULAR BIOLOGY AND EVOLUTION, 2022, 39 (12)
  • [10] Improved generative adversarial network with deep metric learning for missing data imputation
    Al-taezi, Mohammed Ali
    Wang, Yu
    Zhu, Pengfei
    Hu, Qinghua
    Al-badwi, Abdulrahman
    NEUROCOMPUTING, 2024, 570