Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results

被引：54

作者：

An, Chansik ^{[1
,2
]}

Park, Yae Won ^{[3
,4
,5
]}

Ahn, Sung Soo ^{[3
,4
,5
]}

Han, Kyunghwa ^{[3
,4
,5
]}

Kim, Hwiyoung ^{[3
,4
,5
]}

Lee, Seung-Koo ^{[3
,4
,5
]}

机构：

[1] Natl Hlth Insurance Serv Ilsan Hosp, Dept Radiol, Goyang, South Korea

[2] Natl Hlth Insurance Serv Ilsan Hosp, Res Inst, Goyang, South Korea

[3] Yonsei Univ, Coll Med, Dept Radiol, Seoul, South Korea

[4] Yonsei Univ, Coll Med, Res Inst Radiol Sci, Seoul, South Korea

[5] Yonsei Univ, Coll Med, Ctr Clin Imaging Data Sci, Seoul, South Korea

来源：

PLOS ONE | 2021年 / 16卷 / 08期

基金：

新加坡国家研究基金会;

关键词：

OPERATING CHARACTERISTIC CURVES; EXTERNAL VALIDATION; PREDICTION MODELS; PERFORMANCE; AREAS;

D O I：

10.1371/journal.pone.0256152

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model and its gap from the test performance under different conditions, using real-world brain tumor radiomics data. We conducted two classification tasks of different difficulty levels with magnetic resonance imaging (MRI) radiomics features: (1) "Simple" task, glioblastomas [n = 109] vs. brain metastasis [n = 58] and (2) "difficult" task, low- [n = 163] vs. high-grade [n = 95] meningiomas. Additionally, two undersampled datasets were created by randomly sampling 50% from these datasets. We performed random training-test set splitting for each dataset repeatedly to create 1,000 different training-test set pairs. For each dataset pair, the least absolute shrinkage and selection operator model was trained and evaluated using various validation methods in the training set, and tested in the test set, using the area under the curve (AUC) as an evaluation metric. The AUCs in training and testing varied among different training-test set pairs, especially with the undersampled datasets and the difficult task. The mean (+/- standard deviation) AUC difference between training and testing was 0.039 (+/- 0.032) for the simple task without undersampling and 0.092 (+/- 0.071) for the difficult task with undersampling. In a training-test set pair with the difficult task without undersampling, for example, the AUC was high in training but much lower in testing (0.882 and 0.667, respectively); in another dataset pair with the same task, however, the AUC was low in training but much higher in testing (0.709 and 0.911, respectively). When the AUC discrepancy between training and test, or generalization gap, was large, none of the validation methods helped sufficiently reduce the generalization gap. Our results suggest that machine learning after a single random training-test set split may lead to unreliable results in radiomics studies especially with small sample sizes.

引用

页数：13

共 32 条

[1]

An C, 2021, GUTHUB PAGE NOT SPLI

[2] Events per variable (EPV) and the relative performance of different strategies for estimating the out-of-sample validity of logistic regression models [J].

Austin, Peter C. ;

Steyerberg, Ewout W. .

STATISTICAL METHODS IN MEDICAL RESEARCH, 2017, 26 (02) :796-808

[3] Robust performance of deep learning for distinguishing glioblastoma from single brain metastasis using radiomic features: model development and validation [J].

Bae, Sohi ;

An, Chansik ;

Ahn, Sung Soo ;

Kim, Hwiyoung ;

Han, Kyunghwa ;

Kim, Sang Wook ;

Park, Ji Eun ;

Kim, Ho Sung ;

Lee, Seung-Koo .

SCIENTIFIC REPORTS, 2020, 10 (01)

[4] Accuracy of deep learning to differentiate the histopathological grading of meningiomas on MR images: A preliminary study [J].

Banzato, Tommaso ;

Causin, Francesco ;

Della Puppa, Alessandro ;

Cester, Giacomo ;

Mazzai, Linda ;

Zotti, Alessandro .

JOURNAL OF MAGNETIC RESONANCE IMAGING, 2019, 50 (04) :1152-1159

[5]

Cawley GC, 2010, J MACH LEARN RES, V11, P2079

[6] The Diagnostic Value of Radiomics-Based Machine Learning in Predicting the Grade of Meningiomas Using Conventional Magnetic Resonance Imaging: A Preliminary Study [J].

Chen, Chaoyue ;

Guo, Xinyi ;

Wang, Jian ;

Guo, Wen ;

Ma, Xuelei ;

Xu, Jianguo .

FRONTIERS IN ONCOLOGY, 2019, 9

[7] Radiomics-Based Machine Learning in Differentiation Between Glioblastoma and Metastatic Brain Tumors [J].

Chen, Chaoyue ;

Ou, Xuejin ;

Wang, Jian ;

Guo, Wen ;

Ma, Xuelei .

FRONTIERS IN ONCOLOGY, 2019, 9

[8] COMPARING THE AREAS UNDER 2 OR MORE CORRELATED RECEIVER OPERATING CHARACTERISTIC CURVES - A NONPARAMETRIC APPROACH [J].

DELONG, ER ;

DELONG, DM ;

CLARKEPEARSON, DI .

BIOMETRICS, 1988, 44 (03) :837-845

[9] The Unreasonable Effectiveness of Data [J].

Halevy, Alon ;

Norvig, Peter ;

Pereira, Fernando .

IEEE INTELLIGENT SYSTEMS, 2009, 24 (02) :8-12

[10] Sampling variability of nonparametric estimates of the areas under receiver operating characteristic curves: An update [J].

Hanley, JA ;

HajianTilaki, KO .

ACADEMIC RADIOLOGY, 1997, 4 (01) :49-58

← 1 2 3 4 →