Computational curation and analysis of publicly available protein sequence data from a single protein family

被引:2
作者
Dougherty, Kyra [1 ]
Hudak, Katalin A. [1 ]
机构
[1] York Univ, Dept Biol, Toronto, ON, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
RNA-glycosylase; Ribosome inactivating protein; Gene tree; Phylogenetic inference; Bioinformatic analysis; Protein domain; Sequence conservation; Data mining;
D O I
10.1016/j.mex.2022.101846
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The wealth of sequence data available on public databases is increasing at an exponential rate, and while tremendous effort s are being made to make access to these resources easier, these data can be challenging for researchers to reuse because submissions are made from numerous laboratories with different biological objectives, resulting in inconsistent naming conventions and sequence content. Researchers can manually inspect each sequence and curate a dataset by hand but automating some of these steps will reduce this burden. This paper is a step-by-step guide describing how to identify all proteins containing a specific domain with the Conserved Protein Domain Architecture Retrieval Tool, download all associated amino acid sequences from NCBI Entrez, tabulate, and clean the data. I will also describe how to extract the full taxonomic information and computationally predict some physicochemical properties of the proteins based on amino acid sequence. The resulting data are applicable to a wide range of bioinformatic analyses where publicly available data are utilized.center dot Step-by-step guide to gathering, cleaning, and parsing data from publicly available databases for computational analysis, plus supplementation of taxonomic data and physicochemical characteristics from sequence data.center dot This strategy allows for reuse of existing large-scale publicly available data for different downstream applications to answer novel biological questions.(c) 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/4.0/ )
引用
收藏
页数:13
相关论文
共 9 条
  • [1] Chamberlain Scott A, 2013, F1000Res, V2, P191, DOI 10.12688/f1000research.2-191.v1
  • [2] Charif D, 2007, STRUCTURAL APPROACHE, P207, DOI [DOI 10.1007/978-3-540-35306-5_10, 10.1007/978-3-540-35306-5_10]
  • [3] Phylogeny and domain architecture of plant ribosome inactivating proteins
    Dougherty, Kyra
    Hudak, Katalin A.
    [J]. PHYTOCHEMISTRY, 2022, 202
  • [4] CDART: Protein homology by domain architecture
    Geer, LY
    Domrachev, M
    Lipman, DJ
    Bryant, SH
    [J]. GENOME RESEARCH, 2002, 12 (10) : 1619 - 1623
  • [5] Osorio D, 2015, R J, V7, P4
  • [6] Pages H., 2022, BIOSTRINGS EFFICIENT
  • [7] Database resources of the national center for biotechnology information
    Sayers, Eric W.
    Bolton, Evan E.
    Brister, J. Rodney
    Canese, Kathi
    Chan, Jessica
    Comeau, Donald C.
    Connor, Ryan
    Funk, Kathryn
    Kelly, Chris
    Kim, Sunghwan
    Madej, Tom
    Marchler-Bauer, Aron
    Lanczycki, Christopher
    Lathrop, Stacy
    Lu, Zhiyong
    Thibaud-Nissen, Francoise
    Murphy, Terence
    Phan, Lon
    Skripchenko, Yuri
    Tse, Tony
    Wang, Jiyao
    Williams, Rebecca
    Trawick, Barton W.
    Pruitt, Kim D.
    Sherry, Stephen T.
    [J]. NUCLEIC ACIDS RESEARCH, 2022, 50 (D1) : D20 - D26
  • [8] Wickham H., 2019, J OPEN SOURCE SOFTW, V4, DOI [10.21105/joss.01686, DOI 10.21105/JOSS.01686]
  • [9] Yang Mingzhang, 2020, Curr Protoc Bioinformatics, V69, pe90, DOI 10.1002/cpbi.90