The correctness of large scale analysis of genomic data

被引:1
作者
Wojciechowski, Pawel [1 ,2 ]
Krause, Karol [1 ]
Lukasiak, Piotr [1 ,2 ]
Blazewicz, Jacek [1 ,2 ]
机构
[1] Poznan Univ Tech, Inst Comp Sci, Poznan, Poland
[2] Polish Acad Sci, Inst Bioorgan Chem, Lab Genom, Warsaw, Poland
关键词
genomic data; large scale analysis; processing pipeline;
D O I
10.2478/fcds-2021-0024
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Implementing a large genomic project is a demanding task, also from the computer science point of view. Besides collecting many genome samples and sequencing them, there is processing of a huge amount of data at every stage of their production and analysis. Efficient transfer and storage of the data is also an important issue. During the execution of such a project, there is a need to maintain work standards and control quality of the results, which can be difficult if a part of the work is carried out externally. Here, we describe our experience with such data quality analysis on a number of levels - from an obvious check of the quality of the results obtained, to examining consistency of the data at various stages of their processing, to verifying, as far as possible, their compatibility with the data describing the sample.
引用
收藏
页码:423 / 436
页数:14
相关论文
共 24 条
  • [1] A map of human genome variation from population-scale sequencing
    Altshuler, David
    Durbin, Richard M.
    Abecasis, Goncalo R.
    Bentley, David R.
    Chakravarti, Aravinda
    Clark, Andrew G.
    Collins, Francis S.
    De la Vega, Francisco M.
    Donnelly, Peter
    Egholm, Michael
    Flicek, Paul
    Gabriel, Stacey B.
    Gibbs, Richard A.
    Knoppers, Bartha M.
    Lander, Eric S.
    Lehrach, Hans
    Mardis, Elaine R.
    McVean, Gil A.
    Nickerson, DebbieA.
    Peltonen, Leena
    Schafer, Alan J.
    Sherry, Stephen T.
    Wang, Jun
    Wilson, Richard K.
    Gibbs, Richard A.
    Deiros, David
    Metzker, Mike
    Muzny, Donna
    Reid, Jeff
    Wheeler, David
    Wang, Jun
    Li, Jingxiang
    Jian, Min
    Li, Guoqing
    Li, Ruiqiang
    Liang, Huiqing
    Tian, Geng
    Wang, Bo
    Wang, Jian
    Wang, Wei
    Yang, Huanming
    Zhang, Xiuqing
    Zheng, Huisong
    Lander, Eric S.
    Altshuler, David L.
    Ambrogio, Lauren
    Bloom, Toby
    Cibulskis, Kristian
    Fennell, Tim J.
    Gabriel, Stacey B.
    [J]. NATURE, 2010, 467 (7319) : 1061 - 1073
  • [2] The Genome of a Mongolian Individual Reveals the Genetic Imprints of Mongolians on Modern Human Populations
    Bai, Haihua
    Guo, Xiaosen
    Zhang, Dong
    Narisu, Narisu
    Bu, Junjie
    Jirimutu, Jirimutu
    Liang, Fan
    Zhao, Xiang
    Xing, Yanping
    Wang, Dingzhu
    Li, Tongda
    Zhang, Yanru
    Guan, Baozhu
    Yang, Xukui
    Yang, Zili
    Shuangshan, Shuangshan
    Su, Zhe
    Wu, Huiguang
    Li, Wenjing
    Chen, Ming
    Zhu, Shilin
    Bayinnamula, Bayinnamula
    Chang, Yuqi
    Gao, Ying
    Lan, Tianming
    Suyalatu, Suyalatu
    Huang, Hui
    Su, Yan
    Chen, Yujie
    Li, Wenqi
    Yang, Xu
    Feng, Qiang
    Wang, Jian
    Yang, Huanming
    Wang, Jun
    Wu, Qizhu
    Yin, Ye
    Zhou, Huanmin
    [J]. GENOME BIOLOGY AND EVOLUTION, 2014, 6 (12): : 3122 - 3136
  • [3] The rise of the genome and personalised medicine
    Brittain, Helen K.
    Scott, Richard
    Thomas, Ellen
    [J]. CLINICAL MEDICINE, 2017, 17 (06) : 545 - 551
  • [4] Caulfield M., 2020, NATL GENOMIC RES LIB
  • [5] ALGORITHMS FOR COMPUTING THE SAMPLE VARIANCE - ANALYSIS AND RECOMMENDATIONS
    CHAN, TF
    GOLUB, GH
    LEVEQUE, RJ
    [J]. AMERICAN STATISTICIAN, 1983, 37 (03) : 242 - 247
  • [6] fastp: an ultra-fast all-in-one FASTQ preprocessor
    Chen, Shifu
    Zhou, Yanqing
    Chen, Yaru
    Gu, Jia
    [J]. BIOINFORMATICS, 2018, 34 (17) : 884 - 890
  • [7] An ethnically relevant consensus Korean reference genome is a step towards personal reference genomes
    Cho, Yun Sung
    Kim, Hyunho
    Kim, Hak-Min
    Jho, Sungwoong
    Jun, JeHoon
    Lee, Yong Joo
    Chae, Kyun Shik
    Kim, Chang Geun
    Kim, Sangsoo
    Eriksson, Anders
    Edwards, Jeremy S.
    Lee, Semin
    Kim, Byung Chul
    Manica, Andrea
    Oh, Tae-Kwang
    Church, George M.
    Bhak, Jong
    [J]. NATURE COMMUNICATIONS, 2016, 7
  • [8] ContEst: estimating cross-contamination of human samples in next-generation sequencing data
    Cibulskis, Kristian
    McKenna, Aaron
    Fennell, Tim
    Banks, Eric
    DePristo, Mark
    Getz, Gad
    [J]. BIOINFORMATICS, 2011, 27 (18) : 2601 - 2602
  • [9] Consortium G. P., 2015, NATURE, V526, P68, DOI [DOI 10.1038/NATURE15393, 10.1038/nature15393]
  • [10] Twelve years of SAMtools and BCFtools
    Danecek, Petr
    Bonfield, James K.
    Liddle, Jennifer
    Marshall, John
    Ohan, Valeriu
    Pollard, Martin O.
    Whitwham, Andrew
    Keane, Thomas
    McCarthy, Shane A.
    Davies, Robert M.
    Li, Heng
    [J]. GIGASCIENCE, 2021, 10 (02):