A new tool called DISSECT for analysing large genomic data sets using a Big Data approach

被引:39
|
作者
Canela-Xandri, Oriol [1 ,2 ]
Law, Andy [1 ,2 ]
Gray, Alan [3 ]
Woolliams, John A. [1 ,2 ]
Tenesa, Albert [1 ,2 ,4 ]
机构
[1] Univ Edinburgh, Roslin Inst, Edinburgh EH25 9RG, Midlothian, Scotland
[2] Univ Edinburgh, Royal Dick Sch Vet Studies, Edinburgh EH25 9RG, Midlothian, Scotland
[3] Univ Edinburgh, EPCC, Edinburgh EH9 3FD, Midlothian, Scotland
[4] Univ Edinburgh, MRC IGMM, MRC HGU, Edinburgh EH4 2XU, Midlothian, Scotland
基金
英国医学研究理事会; 英国生物技术与生命科学研究理事会;
关键词
AVERAGE INFORMATION REML; MIXED-MODEL ANALYSIS; GENETIC RISK; ASSOCIATION; PREDICTION; DISEASE; TRAITS; ACCURACY;
D O I
10.1038/ncomms10162
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Large-scale genetic and genomic data are increasingly available and the major bottleneck in their analysis is a lack of sufficiently scalable computational tools. To address this problem in the context of complex traits analysis, we present DISSECT. DISSECT is a new and freely available software that is able to exploit the distributed-memory parallel computational architectures of compute clusters, to perform a wide range of genomic and epidemiologic analyses, which currently can only be carried out on reduced sample sizes or under restricted conditions. We demonstrate the usefulness of our new tool by addressing the challenge of predicting phenotypes from genotype data in human populations using mixed-linear model analysis. We analyse simulated traits from 470,000 individuals genotyped for 590,004 SNPs in similar to 4 h using the combined computational power of 8,400 processor cores. We find that prediction accuracies in excess of 80% of the theoretical maximum could be achieved with large sample sizes.
引用
收藏
页数:6
相关论文
共 50 条
  • [31] Error propagation in spatial modeling of public health data: a simulation approach using pediatric blood lead level data for Syracuse, New York
    Lee, Monghyeon
    Chun, Yongwan
    Griffith, Daniel A.
    ENVIRONMENTAL GEOCHEMISTRY AND HEALTH, 2018, 40 (02) : 667 - 681
  • [32] Overcoming the dichotomy between open and isolated populations using genomic data from a large European dataset
    Anagnostou, Paolo
    Dominici, Valentina
    Battaggia, Cinzia
    Pagani, Luca
    Vilar, Miguel
    Wells, R. Spencer
    Pettener, Davide
    Sarno, Stefania
    Boattini, Alessio
    Francalacci, Paolo
    Colonna, Vincenza
    Vona, Giuseppe
    Calo, Carla
    Bisol, Giovanni Destro
    Tofanelli, Sergio
    SCIENTIFIC REPORTS, 2017, 7
  • [33] A novel ensemble approach for multicategory classification of DNA microarray data using biological relevant gene sets
    Reboiro-Jato, Miguel
    Glez-Pena, Daniel
    Diaz, Fernando
    Fdez-Riverola, Florentino
    INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2012, 6 (06) : 602 - 616
  • [34] Advanced K-Means Clustering Algorithm for Large ECG Data Sets Based on K-SVD Approach
    Balouchestani, Mohammadreza
    Sugavaneswaran, Lakshmi
    Krishnan, Sridhar
    2014 9TH INTERNATIONAL SYMPOSIUM ON COMMUNICATION SYSTEMS, NETWORKS & DIGITAL SIGNAL PROCESSING (CSNDSP), 2014, : 177 - 182
  • [35] Validation of an approach using only patient big data from clinical laboratories to establish reference intervals for thyroid hormones based on data mining
    Ma, Chaochao
    Cheng, Xinqi
    Xue, Fang
    Li, Xiaoqi
    Yin, Yicong
    Wu, Jie
    Xia, Liangyu
    Guo, Xiuzhi
    Hu, Yingying
    Qiu, Ling
    Xu, Tengda
    CLINICAL BIOCHEMISTRY, 2020, 80 : 25 - 30
  • [36] TopScore: Using Deep Neural Networks and Large Diverse Data Sets for Accurate Protein Model Quality Assessment
    Mulnaes, Daniel
    Gohlke, Holger
    JOURNAL OF CHEMICAL THEORY AND COMPUTATION, 2018, 14 (11) : 6117 - 6126
  • [37] Revisiting superiority and stability metrics of cultivar performances using genomic data: derivations of new estimators
    Carvalho, Humberto Fanelli
    Rio, Simon
    Garcia-Abadillo, Julian
    Isidro y Sanchez, Julio
    PLANT METHODS, 2024, 20 (01)
  • [38] Practical implications of using non-relational databases to store large genomic data files and novel phenotypes
    Souza, Andre Moreira
    Santos Weigert, Rodrigo de Andrade
    Machado de Sousa, Elaine Parros
    Andrietta, Lucas Tassoni
    Ventura, Ricardo Vieira
    JOURNAL OF ANIMAL BREEDING AND GENETICS, 2022, 139 (01) : 100 - 112
  • [39] A New Strategy for Evaluating the Quality of Laboratory Results for Big Data Research: Using External Quality Assessment Survey Data (2010-2020)
    Cho, Eun-Jung
    Jeong, Tae-Dong
    Kim, Sollip
    Park, Hyung-Doo
    Yun, Yeo-Min
    Chun, Sail
    Min, Won-Ki
    ANNALS OF LABORATORY MEDICINE, 2023, 43 (05) : 425 - 433
  • [40] A new approach to spatial data interpolation using higher-order statistics
    Liu, Shen
    Vo Anh
    McGree, James
    Kozan, Erhan
    Wolff, Rodney C.
    STOCHASTIC ENVIRONMENTAL RESEARCH AND RISK ASSESSMENT, 2015, 29 (06) : 1679 - 1690