Automated Retrieval of Heterogeneous Proteomic Data for Machine Learning

被引:0
作者
Rafay, Abdul [1 ,2 ]
Aziz, Muzzamil [2 ]
Zia, Amjad [1 ]
Asif, Abdul R. R. [1 ,3 ]
机构
[1] Univ Med Ctr, Dept Clin Chem, Interdisciplinary UMG Labs, D-37075 Gottingen, Germany
[2] Gesell Wissensch Datenverarbeitung mbH Gottingen G, Future Networks, eSci Grp, D-37077 Gottingen, Germany
[3] German Ctr Cardiovasc Res DZHK, Partner Site Gottingen, D-37075 Gottingen, Germany
关键词
mass spectrometry; proteomics; machine learning; data scraping; data harvesting;
D O I
10.3390/jpm13050790
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Proteomics instrumentation and the corresponding bioinformatics tools have evolved at a rapid pace in the last 20 years, whereas the exploitation of deep learning techniques in proteomics is on the horizon. The ability to revisit proteomics raw data, in particular, could be a valuable resource for machine learning applications seeking new insight into protein expression and functions of previously acquired data from different instruments under various lab conditions. We map publicly available proteomics repositories (such as ProteomeXchange) and relevant publications to extract MS/MS data to form one large database that contains the patient history and mass spectrometric data acquired for the patient sample. The extracted mapped dataset should enable the research to overcome the issues attached to the dispersions of proteomics data on the internet, which makes it difficult to apply emerging new bioinformatics tools and deep learning algorithms. The workflow proposed in this study enables a linked large dataset of heart-related proteomics data, which could be easily and efficiently applied to machine learning and deep learning algorithms for futuristic predictions of heart diseases and modeling. Data scraping and crawling offer a powerful tool to harvest and prepare the training and test datasets; however, the authors advocate caution because of ethical and legal issues, as well as the need to ensure the quality and accuracy of the data that are being collected.
引用
收藏
页数:10
相关论文
共 15 条
[1]   Mass Spectrometry-Based Proteomics Workflows in Cancer Research: The Relevance of Choosing the Right Steps [J].
Carrillo-Rodriguez, Paula ;
Selheim, Frode ;
Hernandez-Valladares, Maria .
CANCERS, 2023, 15 (02)
[2]   Bioinformatics Methods for Mass Spectrometry-Based Proteomics Data Analysis [J].
Chen, Chen ;
Hou, Jie ;
Tanner, John J. ;
Cheng, Jianlin .
INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2020, 21 (08)
[3]   PASSEL: The PeptideAtlas SRM experiment library [J].
Farrah, Terry ;
Deutsch, Eric W. ;
Kreisberg, Richard ;
Sun, Zhi ;
Campbell, David S. ;
Mendoza, Luis ;
Kusebauch, Ulrike ;
Brusniak, Mi-Youn ;
Huettenhain, Ruth ;
Schiess, Ralph ;
Selevsek, Nathalie ;
Aebersold, Ruedi ;
Moritz, Robert L. .
PROTEOMICS, 2012, 12 (08) :1170-1175
[4]   Advances in high-resolution mass spectrometry applied to pharmaceuticals in 2020: A whole new age of information [J].
Gehin, Caroline ;
Holman, Stephen W. .
ANALYTICAL SCIENCE ADVANCES, 2021, 2 (3-4) :142-156
[5]   iProX: an integrated proteome resource [J].
Ma, Jie ;
Chen, Tao ;
Wu, Songfeng ;
Yang, Chunyuan ;
Bai, Mingze ;
Shu, Kunxian ;
Li, Kenli ;
Zhang, Guoqing ;
Jin, Zhong ;
He, Fuchu ;
Hermjakob, Henning ;
Zhu, Yunping .
NUCLEIC ACIDS RESEARCH, 2019, 47 (D1) :D1211-D1217
[6]   IPDS: A semantic mediator-based system using Spark for the integration of heterogeneous proteomics data sources [J].
Messaoudi, Chaimaa ;
Fissoune, Rachida ;
Badir, Hassan .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2021, 33 (01)
[7]   Deep learning neural network tools for proteomics [J].
Meyer, Jesse G. .
CELL REPORTS METHODS, 2021, 1 (02)
[8]   The jPOST environment: an integrated proteomics data repository and database [J].
Moriya, Yuki ;
Kawano, Shin ;
Okuda, Shujiro ;
Watanabe, Yu ;
Matsumoto, Masaki ;
Takami, Tomoyo ;
Kobayashi, Daiki ;
Yamanouchi, Yoshinori ;
Araki, Norie ;
Yoshizawa, Akiyasu C. ;
Tabata, Tsuyoshi ;
Iwasaki, Mio ;
Sugiyama, Naoyuki ;
Tanaka, Satoshi ;
Goto, Susumu ;
Ishihama, Yasushi .
NUCLEIC ACIDS RESEARCH, 2019, 47 (D1) :D1218-D1224
[9]   The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences [J].
Perez-Riverol, Yasset ;
Bai, Jingwen ;
Bandla, Chakradhar ;
Garcia-Seisdedos, David ;
Hewapathirana, Suresh ;
Kamatchinathan, Selvakumar ;
Kundu, Deepti J. ;
Prakash, Ananth ;
Frericks-Zipper, Anika ;
Eisenacher, Martin ;
Walzer, Mathias ;
Wang, Shengbo ;
Brazma, Alvis ;
Vizcaino, Juan Antonio .
NUCLEIC ACIDS RESEARCH, 2022, 50 (D1) :D543-D552
[10]   Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks [J].
Samadi, Yassir ;
Zbakh, Mostapha ;
Tadonki, Claude .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2018, 30 (12)