snpfiltr: An R package for interactive and reproducible SNP filtering

被引:45
作者
DeRaad, Devon [1 ,2 ]
机构
[1] Univ Kansas, Dept Ecol, Lawrence, KS 66045 USA
[2] Univ Kansas, Evolutionary Biol & Biodivers Inst, Lawrence, KS 66045 USA
关键词
bioinfomatics; phyloinfomatics; genomics; missing data; R package; reproducibility; SNP filtering; SOFTWARE PACKAGE;
D O I
10.1111/1755-0998.13618
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Here, I describe the R package snpfiltr and demonstrate its functionality as the backbone of a customizable, reproducible single nucleotide polymorphism (SNP) filtering pipeline implemented exclusively via the widely adopted R programming language. SNPfiltR extends existing SNP filtering functionalities by automating the visualization of key parameters such as sequencing depth, quality, and missing data proportion, allowing users to visually optimize and implement filtering thresholds within a single, cohesive work session. All SNPfiltR functions require vcfr objects as input, which can be easily generated by reading a SNP data set stored in standard variant call format (vcf) into an R working environment using the function read.vcfR() from the R package vcfr. Performance and accuracy benchmarking reveal that for moderately sized SNP data sets (up to 50 M genotypes, plus associated quality information), SNPfiltR performs filtering with comparable accuracy and efficiency to current state of the art command-line-based programs. These results indicate that for most reduced-representation genomic data sets, SNPfiltR is an ideal choice for investigating, visualizing, and filtering SNPs as part of a user friendly bioinformatic pipeline. The snpfiltr package can be downloaded from CRAN with the command install.packages("snpfiltr"), and the current development version is available from GitHub at: (). Thorough documentation for SNPfiltR, including multiple comprehensive vignettes detailing realistic use-cases, is available at the website: .
引用
收藏
页码:2443 / 2453
页数:11
相关论文
共 28 条
[1]  
[Anonymous], 2015, microbenchmark: Accurate Timing Functions
[2]   Removing the bad apples: A simple bioinformatic method to improve loci-recovery in de novo RADseq data for non-model organisms [J].
Cerca, Jose ;
Maurstad, Marius F. ;
Rochette, Nicolas C. ;
Rivera-Colon, Angel G. ;
Rayamajhi, Niraj ;
Catchen, Julian M. ;
Struck, Torsten H. .
METHODS IN ECOLOGY AND EVOLUTION, 2021, 12 (05) :805-817
[3]   The variant call format and VCFtools [J].
Danecek, Petr ;
Auton, Adam ;
Abecasis, Goncalo ;
Albers, Cornelis A. ;
Banks, Eric ;
DePristo, Mark A. ;
Handsaker, Robert E. ;
Lunter, Gerton ;
Marth, Gabor T. ;
Sherry, Stephen T. ;
McVean, Gilean ;
Durbin, Richard .
BIOINFORMATICS, 2011, 27 (15) :2156-2158
[4]   Twelve years of SAMtools and BCFtools [J].
Danecek, Petr ;
Bonfield, James K. ;
Liddle, Jennifer ;
Marshall, John ;
Ohan, Valeriu ;
Pollard, Martin O. ;
Whitwham, Andrew ;
Keane, Thomas ;
McCarthy, Shane A. ;
Davies, Robert M. ;
Li, Heng .
GIGASCIENCE, 2021, 10 (02)
[5]   RADSeq: next-generation population genetics [J].
Davey, John L. ;
Blaxter, Mark W. .
BRIEFINGS IN FUNCTIONAL GENOMICS, 2010, 9 (5-6) :416-423
[6]   SambaR: An R package for fast, easy and reproducible population-genetic analyses of biallelic SNP data sets [J].
de Jong, Menno J. ;
de Jong, Joost F. ;
Hoelzel, A. Rus ;
Janke, Axel .
MOLECULAR ECOLOGY RESOURCES, 2021, 21 (04) :1369-1379
[7]  
DeRaad D. A., 2022, GITHUB REPOSITORY, DOI [10.5281/zenodo.6284749, DOI 10.5281/ZENODO.6284749], Patent No. github.com/DevonDeRaad/ SNPfiltR
[8]   ipyrad: Interactive assembly and analysis of RADseq datasets [J].
Eaton, Deren A. R. ;
Overcast, Isaac .
BIOINFORMATICS, 2020, 36 (08) :2592-2594
[9]   PHYLUCE is a software package for the analysis of conserved genomic loci [J].
Faircloth, Brant C. .
BIOINFORMATICS, 2016, 32 (05) :786-788
[10]   Ultraconserved Elements Anchor Thousands of Genetic Markers Spanning Multiple Evolutionary Timescales [J].
Faircloth, Brant C. ;
McCormack, John E. ;
Crawford, Nicholas G. ;
Harvey, Michael G. ;
Brumfield, Robb T. ;
Glenn, Travis C. .
SYSTEMATIC BIOLOGY, 2012, 61 (05) :717-726