RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor

被引:5
作者
Pallotta, Simone [1 ]
Cascianelli, Silvia [1 ]
Masseroli, Marco [1 ]
机构
[1] Dipartimento Elettron & Informaz & Bioingn, Via Ponzio 34-5, I-20133 Milan, Italy
基金
欧洲研究理事会;
关键词
Heterogeneous omics big data; Data scalability; Distribution transparency; Tertiary data analysis; GENOMICS; TOOLKIT; BINDING; HADOOP; VHL;
D O I
10.1186/s12859-022-04648-4
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures. Results We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions. Conclusions RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.
引用
收藏
页数:28
相关论文
共 53 条
  • [1] The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update
    Afgan, Enis
    Baker, Dannon
    van den Beek, Marius
    Blankenberg, Daniel
    Bouvier, Dave
    Cech, Martin
    Chilton, John
    Clements, Dave
    Coraor, Nate
    Eberhard, Carl
    Gruening, Bjoern
    Guerler, Aysam
    Hillman-Jackson, Jennifer
    Von Kuster, Greg
    Rasche, Eric
    Soranzo, Nicola
    Turaga, Nitesh
    Taylor, James
    Nekrutenko, Anton
    Goecks, Jeremy
    [J]. NUCLEIC ACIDS RESEARCH, 2016, 44 (W1) : W3 - W10
  • [2] A map of human genome variation from population-scale sequencing
    Altshuler, David
    Durbin, Richard M.
    Abecasis, Goncalo R.
    Bentley, David R.
    Chakravarti, Aravinda
    Clark, Andrew G.
    Collins, Francis S.
    De la Vega, Francisco M.
    Donnelly, Peter
    Egholm, Michael
    Flicek, Paul
    Gabriel, Stacey B.
    Gibbs, Richard A.
    Knoppers, Bartha M.
    Lander, Eric S.
    Lehrach, Hans
    Mardis, Elaine R.
    McVean, Gil A.
    Nickerson, DebbieA.
    Peltonen, Leena
    Schafer, Alan J.
    Sherry, Stephen T.
    Wang, Jun
    Wilson, Richard K.
    Gibbs, Richard A.
    Deiros, David
    Metzker, Mike
    Muzny, Donna
    Reid, Jeff
    Wheeler, David
    Wang, Jun
    Li, Jingxiang
    Jian, Min
    Li, Guoqing
    Li, Ruiqiang
    Liang, Huiqing
    Tian, Geng
    Wang, Bo
    Wang, Jian
    Wang, Wei
    Yang, Huanming
    Zhang, Xiuqing
    Zheng, Huisong
    Lander, Eric S.
    Altshuler, David L.
    Ambrogio, Lauren
    Bloom, Toby
    Cibulskis, Kristian
    Fennell, Tim J.
    Gabriel, Stacey B.
    [J]. NATURE, 2010, 467 (7319) : 1061 - 1073
  • [3] A global reference for human genetic variation
    Altshuler, David M.
    Durbin, Richard M.
    Abecasis, Goncalo R.
    Bentley, David R.
    Chakravarti, Aravinda
    Clark, Andrew G.
    Donnelly, Peter
    Eichler, Evan E.
    Flicek, Paul
    Gabriel, Stacey B.
    Gibbs, Richard A.
    Green, Eric D.
    Hurles, Matthew E.
    Knoppers, Bartha M.
    Korbel, Jan O.
    Lander, Eric S.
    Lee, Charles
    Lehrach, Hans
    Mardis, Elaine R.
    Marth, Gabor T.
    McVean, Gil A.
    Nickerson, Deborah A.
    Wang, Jun
    Wilson, Richard K.
    Boerwinkle, Eric
    Doddapaneni, Harsha
    Han, Yi
    Korchina, Viktoriya
    Kovar, Christie
    Lee, Sandra
    Muzny, Donna
    Reid, Jeffrey G.
    Zhu, Yiming
    Chang, Yuqi
    Feng, Qiang
    Fang, Xiaodong
    Guo, Xiaosen
    Jian, Min
    Jiang, Hui
    Jin, Xin
    Lan, Tianming
    Li, Guoqing
    Li, Jingxiang
    Li, Yingrui
    Liu, Shengmao
    Liu, Xiao
    Lu, Yao
    Ma, Xuedi
    Tang, Meifang
    Wang, Bo
    [J]. NATURE, 2015, 526 (7571) : 68 - +
  • [4] Role of VHL gene mutation in human renal cell carcinoma
    Arjumand, Wani
    Sultana, Sarwat
    [J]. TUMOR BIOLOGY, 2012, 33 (01) : 9 - 16
  • [5] NCBI GEO: archive for functional genomics data sets-update
    Barrett, Tanya
    Wilhite, Stephen E.
    Ledoux, Pierre
    Evangelista, Carlos
    Kim, Irene F.
    Tomashevsky, Maxim
    Marshall, Kimberly A.
    Phillippy, Katherine H.
    Sherman, Patti M.
    Holko, Michelle
    Yefanov, Andrey
    Lee, Hyeseung
    Zhang, Naigong
    Robertson, Cynthia L.
    Serova, Nadezhda
    Davis, Sean
    Soboleva, Alexandra
    [J]. NUCLEIC ACIDS RESEARCH, 2013, 41 (D1) : D991 - D995
  • [6] Bischl B., 2017, Bbmisc: Miscellaneous Helper Functions for B. Bischl
  • [7] Data Management for Heterogeneous Genomic Datasets
    Ceri, Stefano
    Kaitoua, Abdulrahman
    Masseroli, Marco
    Pinoli, Pietro
    Venco, Francesco
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2017, 14 (06) : 1251 - 1264
  • [8] Extreme HOT regions are CpG-dense promoters in C. elegans and humans
    Chen, Ron A. -J.
    Stempor, Przemyslaw
    Down, Thomas A.
    Zeiser, Eva
    Feuer, Sky K.
    Ahringer, Julie
    [J]. GENOME RESEARCH, 2014, 24 (07) : 1138 - 1146
  • [9] Ciriello G., 2016, CANCER CELL, V29
  • [10] The role of VHL in clear-cell renal cell carcinoma and its relation to targeted therapy
    Clark, Peter E.
    [J]. KIDNEY INTERNATIONAL, 2009, 76 (09) : 939 - 945