Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain

被引:220
作者
Althnian, Alhanoof [1 ]
AlSaeed, Duaa [1 ]
Al-Baity, Heyam [1 ]
Samha, Amani [2 ]
Dris, Alanoud Bin [3 ]
Alzakari, Najla [3 ]
Abou Elwafa, Afnan [4 ]
Kurdi, Heba [4 ,5 ]
机构
[1] King Saud Univ, Dept Informat Technol, Coll Comp & Informat Sci, Riyadh 11451, Saudi Arabia
[2] King Saud Univ, Dept Management Informat Syst, Coll Business Adm, Riyadh 11451, Saudi Arabia
[3] King Abdulaziz City Sci & Technol, Natl Ctr Cyber Secur Technol, Riyadh 11442, Saudi Arabia
[4] King Saud Univ, Dept Comp Sci, Coll Comp & Informat Sci, Riyadh 11451, Saudi Arabia
[5] MIT, Dept Mech Engn, Cambridge, MA 02142 USA
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 02期
关键词
medical data; dataset size; supervised models; classification; performance; machine learning;
D O I
10.3390/app11020796
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naive Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models.
引用
收藏
页码:1 / 18
页数:18
相关论文
共 29 条
[2]   Fuzzy ARTMAP with input relevances [J].
Andonie, Razvan ;
Sasu, Lucian .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 2006, 17 (04) :929-941
[3]  
[Anonymous], 2005, J AM STAT ASSOC
[4]  
[Anonymous], 2009, P INT C BIOINF BIOM
[5]   Impact of dataset size and variety on the effectiveness of deep learning and transfer learning for plant disease classification [J].
Arnal Barbedo, Jayme Garcia .
COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2018, 153 :46-53
[6]  
Blake C. L., 1998, UCI REPOSITORY MACHI, V55
[7]   Extending Sample Information for Small Data Set Prediction [J].
Chen, Hung-Yuj ;
Li, Der-Chiang ;
Lin, Liang-Sian .
PROCEEDINGS 2016 5TH IIAI INTERNATIONAL CONGRESS ON ADVANCED APPLIED INFORMATICS IIAI-AAI 2016, 2016, :710-714
[8]   A PSO based virtual sample generation method for small sample sets: Applications to regression datasets [J].
Chen, Zhong-Sheng ;
Zhu, Bao ;
He, Yan-Lin ;
Yu, Le-An .
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2017, 59 :236-243
[9]   Data properties and the performance of sentiment classification for electronic commerce applications [J].
Choi, Youngseok ;
Lee, Habin .
INFORMATION SYSTEMS FRONTIERS, 2017, 19 (05) :993-1012
[10]   Approximate NORTA simulations for virtual sample generation [J].
Coqueret, Guillaume .
EXPERT SYSTEMS WITH APPLICATIONS, 2017, 73 :69-81