Computational curation and analysis of publicly available protein sequence data from a single protein family

被引：2

作者：

Dougherty, Kyra ^{[1
]}

Hudak, Katalin A. ^{[1
]}

机构：

[1] York Univ, Dept Biol, Toronto, ON, Canada

来源：

METHODSX | 2022年 / 9卷

基金：

加拿大自然科学与工程研究理事会;

关键词：

RNA-glycosylase; Ribosome inactivating protein; Gene tree; Phylogenetic inference; Bioinformatic analysis; Protein domain; Sequence conservation; Data mining;

D O I：

10.1016/j.mex.2022.101846

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

The wealth of sequence data available on public databases is increasing at an exponential rate, and while tremendous effort s are being made to make access to these resources easier, these data can be challenging for researchers to reuse because submissions are made from numerous laboratories with different biological objectives, resulting in inconsistent naming conventions and sequence content. Researchers can manually inspect each sequence and curate a dataset by hand but automating some of these steps will reduce this burden. This paper is a step-by-step guide describing how to identify all proteins containing a specific domain with the Conserved Protein Domain Architecture Retrieval Tool, download all associated amino acid sequences from NCBI Entrez, tabulate, and clean the data. I will also describe how to extract the full taxonomic information and computationally predict some physicochemical properties of the proteins based on amino acid sequence. The resulting data are applicable to a wide range of bioinformatic analyses where publicly available data are utilized.center dot Step-by-step guide to gathering, cleaning, and parsing data from publicly available databases for computational analysis, plus supplementation of taxonomic data and physicochemical characteristics from sequence data.center dot This strategy allows for reuse of existing large-scale publicly available data for different downstream applications to answer novel biological questions.(c) 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/4.0/ )

引用

页数：13

共 9 条

[1] Chamberlain Scott A, 2013, F1000Res, V2, P191, DOI 10.12688/f1000research.2-191.v1
[2] Charif D, 2007, STRUCTURAL APPROACHE, P207, DOI [DOI 10.1007/978-3-540-35306-5_10, 10.1007/978-3-540-35306-5_10]
[3] Phylogeny and domain architecture of plant ribosome inactivating proteins
Dougherty, Kyra
Hudak, Katalin A.
[J]. PHYTOCHEMISTRY, 2022, 202
[4] CDART: Protein homology by domain architecture
Geer, LY
Domrachev, M
Lipman, DJ
Bryant, SH
[J]. GENOME RESEARCH, 2002, 12 (10) : 1619 - 1623
[5] Osorio D, 2015, R J, V7, P4
[6] Pages H., 2022, BIOSTRINGS EFFICIENT
[7] Database resources of the national center for biotechnology information
Sayers, Eric W.
Bolton, Evan E.
Brister, J. Rodney
Canese, Kathi
Chan, Jessica
Comeau, Donald C.
Connor, Ryan
Funk, Kathryn
Kelly, Chris
Kim, Sunghwan
Madej, Tom
Marchler-Bauer, Aron
Lanczycki, Christopher
Lathrop, Stacy
Lu, Zhiyong
Thibaud-Nissen, Francoise
Murphy, Terence
Phan, Lon
Skripchenko, Yuri
Tse, Tony
Wang, Jiyao
Williams, Rebecca
Trawick, Barton W.
Pruitt, Kim D.
Sherry, Stephen T.
[J]. NUCLEIC ACIDS RESEARCH, 2022, 50 (D1) : D20 - D26
[8] Wickham H., 2019, J OPEN SOURCE SOFTW, V4, DOI [10.21105/joss.01686, DOI 10.21105/JOSS.01686]
[9] Yang Mingzhang, 2020, Curr Protoc Bioinformatics, V69, pe90, DOI 10.1002/cpbi.90

← 1 →