A Distributed Framework for Large-scale Protein-protein Interaction Data Analysis and Prediction Using MapReduce

被引:48
作者
Hu, Lun [1 ,5 ]
Yang, Shicheng [2 ]
Luo, Xin [1 ,3 ,4 ]
Yuan, Huaqiang [1 ]
Sedraoui, Khaled [6 ,7 ]
Zhou, MengChu [8 ]
机构
[1] Dongguan Univ Technol, Sch Comp Sci & Technol, Dongguan 523808, Peoples R China
[2] Wuhan Univ Technol, Sch Comp Sci & Technol, Wuhan 430070, Hubei, Peoples R China
[3] Chongqing Inst Green & Intelligent Technol, Chongqing Engn Res Ctr Big Data Applicat Smart Ci, Chongqing 400714, Peoples R China
[4] Chongqing Inst Green & Intelligent Technol, Chongqing Key Lab Big Data & Intelligent Comp, Chinese Acad Sci, Chongqing 400714, Peoples R China
[5] Xinjiang Tech Inst Phys & Chem, Chinese Acad Sci, Urumqi 830000, Peoples R China
[6] King Abdulaziz Univ, Ctr Res Excellence Renewable Energy & Power Syst, Jeddah 21589, Saudi Arabia
[7] King Abdulaziz Univ, Dept Elect & Comp Engn, Fac Engn, Jeddah 21589, Saudi Arabia
[8] New Jersey Inst Technol, Dept Elect & Comp Engn, Newark, NJ 07102 USA
基金
中国国家自然科学基金;
关键词
Distributed computing; large-scale prediction machine learning; MapReduce; protein-protein interaction (PPI); GENE ORDER; NETWORK; ALGORITHM; INFERENCE; MODEL;
D O I
10.1109/JAS.2021.1004198
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Protein-protein interactions are of great significance for human to understand the functional mechanisms of proteins. With the rapid development of high-throughput genomic technologies, massive protein-protein interaction (PPI) data have been generated, making it very difficult to analyze them efficiently. To address this problem, this paper presents a distributed framework by reimplementing one of state-of-the-art algorithms, i.e., CoFex, using MapReduce. To do so, an in-depth analysis of its limitations is conducted from the perspectives of efficiency and memory consumption when applying it for large-scale PPI data analysis and prediction. Respective solutions are then devised to overcome these limitations. In particular, we adopt a novel tree-based data structure to reduce the heavy memory consumption caused by the huge sequence information of proteins. After that, its procedure is modified by following the MapReduce framework to take the prediction task distributively. A series of extensive experiments have been conducted to evaluate the performance of our framework in terms of both efficiency and accuracy. Experimental results well demonstrate that the proposed framework can considerably improve its computational efficiency by more than two orders of magnitude while retaining the same high accuracy.
引用
收藏
页码:160 / 172
页数:13
相关论文
共 50 条
  • [21] Benchmark Evaluation of Protein-Protein Interaction Prediction Algorithms
    Dunham, Brandan
    Ganapathiraju, Madhavi K.
    MOLECULES, 2022, 27 (01):
  • [22] Advances in Computational Methods for Protein-Protein Interaction Prediction
    Xian, Lei
    Wang, Yansu
    ELECTRONICS, 2024, 13 (06)
  • [23] Identifying Protein Complexes in Protein-Protein Interaction Data Using Graph Convolutional Network
    Zaki, Nazar
    Singh, Harsh
    Mohamed, Elfadil A.
    IEEE ACCESS, 2021, 9 : 123717 - 123726
  • [24] Effect of Protein Repetitiveness on Protein-Protein Interaction Prediction Results Using Support Vector Machines
    Zhou, Jie
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2017, 24 (02) : 183 - 192
  • [25] The Family of MapReduce and Large-Scale Data Processing Systems
    Sakr, Sherif
    Liu, Anna
    Fayoumi, Ayman G.
    ACM COMPUTING SURVEYS, 2013, 46 (01)
  • [26] A Large-Scale Graph Learning Framework of Technological Gatekeepers by MapReduce
    Liu Tong
    Guo Wensheng
    2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW), 2012, : 1997 - 2003
  • [27] Informing plasmid compatibility with bacterial hosts using protein-protein interaction data
    Downing, Tim
    Lee, Min Jie
    Archbold, Conor
    McDonnell, Adam
    Rahm, Alexander
    GENOMICS, 2022, 114 (06)
  • [28] How reliable are experimental protein-protein interaction data?
    Sprinzak, E
    Sattath, S
    Margalit, H
    JOURNAL OF MOLECULAR BIOLOGY, 2003, 327 (05) : 919 - 923
  • [29] A novel method for protein-protein interaction site prediction using phylogenetic substitution models
    La, David
    Kihara, Daisuke
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2012, 80 (01) : 126 - 141
  • [30] ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design
    Notin, Pascal
    Kollasch, Aaron W.
    Ritter, Daniel
    van Niekerk, Lood
    Paul, Steffanie
    Spinner, Hansen
    Rollins, Nathan
    Shaw, Ada
    Weitzman, Ruben
    Frazer, Jonathan
    Dias, Mafalda
    Franceschi, Dinko
    Frazer, Jonathan
    Dias, Mafalda
    Franceschi, Dinko
    Orenbuch, Rose
    Gal, Yarin
    Marks, Debora S.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,