Comparison of machine learning techniques to handle imbalanced COVID-19 CBC datasets

被引：12

作者：

Dorn, Marcio ^{[1
,2
,3
]}

Grisci, Bruno Iochins ^{[1
]}

Narloch, Pedro Henrique ^{[1
]}

Feltes, Bruno Cesar ^{[1
,4
]}

Avila, Eduardo ^{[3
,5
]}

Kahmann, Alessandro ^{[6
]}

Alho, Clarice Sampaio ^{[3
,5
]}

机构：

[1] Univ Fed Rio Grande do Sul, Inst Informat, Porto Alegre, RS, Brazil

[2] Univ Fed Rio Grande do Sul, Ctr Biotechnol, Porto Alegre, RS, Brazil

[3] Natl Inst Sci & Technol, Forens Sci, Porto Alegre, RS, Brazil

[4] Univ Fed Rio Grande do Sul, Dept Genet, Porto Alegre, RS, Brazil

[5] Pontificia Univ Catolica Rio Grande do Sul, Sch Hlth & Life Sci, Porto Alegre, RS, Brazil

[6] Fed Univ Rio Grande, Inst Math Stat & Phys, Rio Grande, RS, Brazil

来源：

PEERJ COMPUTER SCIENCE | 2021年 / 7卷

关键词：

Machine learning; Data mining; Imbalanced datasets; Covid; Hemogram; CORONAVIRUS DISEASE 2019; CLASSIFICATION; WAVE; TREES; RISK; CT;

D O I：

10.7717/peerj-cs.670

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The Coronavirus pandemic caused by the novel SARS-CoV-2 has significantly impacted human health and the economy, especially in countries struggling with financial resources for medical testing and treatment, such as Brazil's case, the third most affected country by the pandemic. In this scenario, machine learning techniques have been heavily employed to analyze different types of medical data, and aid decision making, offering a low-cost alternative. Due to the urgency to fight the pandemic, a massive amount of works are applying machine learning approaches to clinical data, including complete blood count (CBC) tests, which are among the most widely available medical tests. In this work, we review the most employed machine learning classifiers for CBC data, together with popular sampling methods to deal with the class imbalance. Additionally, we describe and critically analyze three publicly available Brazilian COVID-19 CBC datasets and evaluate the performance of eight classifiers and five sampling techniques on the selected datasets. Our work provides a panorama of which classifier and sampling methods provide the best results for different relevant metrics and discuss their impact on future analyses. The metrics and algorithms are introduced in a way to aid newcomers to the field. Finally, the panorama discussed here can significantly benefit the comparison of the results of new ML algorithms.

引用

页码：1 / 34

页数：34

共 110 条

[61]

Huang YG, 2011, INT CONF CLOUD COMPU, P34

[62]

Imran Ali, 2020, Inform Med Unlocked, V20, P100378, DOI 10.1016/j.imu.2020.100378

[63]

Inampudi S., 2021, ADV COMPUTING 10 INT, P139, DOI 10.1007/978-981-16-0401-0_11

[64] Survey on deep learning with class imbalance [J].

Johnson, Justin M. ;

Khoshgoftaar, Taghi M. .

JOURNAL OF BIG DATA, 2019, 6 (01)

[65] A predictive tool for identification of SARS-CoV-2 PCR-negative emergency department patients using routine test results [J].

Joshi, Rohan P. ;

Pejaver, Vikas ;

Hammarlund, Noah E. ;

Sung, Heungsup ;

Lee, Seong Kyu ;

Furmanchuk, Al'ona ;

Lee, Hye-Young ;

Scott, Gregory ;

Gombar, Saurabh ;

Shah, Nigam ;

Shen, Sam ;

Nassiri, Anna ;

Schneider, Daniel ;

Ahmad, Faraz S. ;

Liebovitz, David ;

Kho, Abel ;

Mooney, Sean ;

Pinsky, Benjamin A. ;

Banaei, Niaz .

JOURNAL OF CLINICAL VIROLOGY, 2020, 129

[66] A Proximity Weighted Evidential k Nearest Neighbor Classifier for Imbalanced Data [J].

Kadir, Md Eusha ;

Akash, Pritom Saha ;

Sharmin, Sadia ;

Ali, Amin Ahsan ;

Shoyaib, Mohammad .

ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2020, PT II, 2020, 12085 :71-83

[67] Interval importance index to select relevant ATR-FTIR wavenumber Intervals for falsified drug classification [J].

Kahmann, A. ;

Anzanello, M. J. ;

Fogliatto, F. S. ;

Chaovalitwongse, W. A. ;

Marcelo, M. C. A. ;

Ferrao, M. F. ;

Ortiz, R. S. ;

Mariotti, K. C. .

JOURNAL OF PHARMACEUTICAL AND BIOMEDICAL ANALYSIS, 2018, 158 :494-503

[68] Supervised Neural Network Modeling: An Empirical Investigation Into Learning From Imbalanced Data With Labeling Errors [J].

Khoshgoftaar, Taghi M. ;

Van Hulse, Jason ;

Napolitano, Amri .

IEEE TRANSACTIONS ON NEURAL NETWORKS, 2010, 21 (05) :813-830

[69]

Kotsiantis SB, 2006, ARTIF INTELL REV, V26, P159, DOI 10.1007/S10462-007-9052-3

[70]

Kubat M., 2017, An Introduction To Machine Learning, VVolume 2, DOI [DOI 10.1007/978-3-319-63913-0, 10.1007/978-3-319-63913-0]

← 2 3 4 5 6 7 8 9 10 11 →