Effect of Various Data Preprocessing in Sequence Embedding-Based Machine Learning for Human-Virus PPI Classification

被引:0
作者
Indriani, Fatma [1 ]
Mahmudah, Kunti Rabiatul [1 ]
Satou, Kenji [2 ]
机构
[1] Kanazawa Univ, Grad Sch Nat Sci & Technol, Kanazawa, Ishikawa, Japan
[2] Kanazawa Univ, Inst Sci & Engn, Kanazawa, Ishikawa, Japan
来源
2021 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATICS ENGINEERING (IC2IE 2021) | 2021年
关键词
classification; human-virus PPI; sequence embedding; data preprocessing; oversampling; SMOTE;
D O I
10.1109/IC2IE53219.2021.9649426
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Identifying human-virus protein-protein interactions (PPI) is an important task which is increasingly researched using computational methods. Previous research shows that using doc2vec encoding scheme for features combined with Random Forest classifier gives promising performance. However, human-virus PPI data are usually imbalanced, and additional preprocessing step has not been investigated in this task. In this work, we investigated various preprocessing methods and modifications to improve classification performance. The result shows that a modification in the feature formulation method, combined with random oversampling can improve the classification AUC result from 0.9414 to 0.9448.
引用
收藏
页码:74 / 78
页数:5
相关论文
共 25 条
[1]   HPIDB 2.0: a curated database for host-pathogen interactions [J].
Ammari, Mais G. ;
Gresham, Cathy R. ;
McCarthy, Fiona M. ;
Nanduri, Bindu .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2016,
[2]  
[Anonymous], 2021, BRIEF BIOINFORM, DOI DOI 10.1093/bib/bbaa068
[3]   Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics [J].
Asgari, Ehsaneddin ;
Mofrad, Mohammad R. K. .
PLOS ONE, 2015, 10 (11)
[4]   VirusMentha: a new resource for virus-host protein interactions [J].
Calderone, Alberto ;
Licata, Luana ;
Cesareni, Gianni .
NUCLEIC ACIDS RESEARCH, 2015, 43 (D1) :D588-D592
[5]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[6]  
covid19.who, WHO Coronavirus (COVID-19) Dashboard
[7]   Machine learning techniques for sequence-based prediction of viral-host interactions between SARS-CoV-2 and human proteins [J].
Dey, Lopamudra ;
Chakraborty, Sanjay ;
Mukhopadhyay, Anirban .
BIOMEDICAL JOURNAL, 2020, 43 (05) :438-450
[8]   The landscape of human proteins interacting with viruses and other pathogens [J].
Dyer, Matthew D. ;
Murali, T. M. ;
Sobral, Bruno W. .
PLOS PATHOGENS, 2008, 4 (02)
[9]   Predicting protein-protein interactions between human and hepatitis C virus via an ensemble learning method [J].
Emamjomeh, Abbasali ;
Goliaei, Bahram ;
Zahiri, Javad ;
Ebrahimpour, Reza .
MOLECULAR BIOSYSTEMS, 2014, 10 (12) :3147-3154
[10]  
Garcia S, 2015, INTEL SYST REF LIBR, V72, P1, DOI 10.1007/978-3-319-10247-4