Ground truth generalizability affects performance of the artificial intelligence model in automated vertebral fracture detection on plain lateral radiographs of the spine

被引：13

作者：

Chou, Po-Hsin ^{[1
,2
]}

Jou, Tony Hong-Ting ^{[1
]}

Wu, Hung-Ta Hondar ^{[3
]}

Yao, Yu-Cheng ^{[1
,2
]}

Lin, Hsi-Hsien ^{[1
,2
]}

Chang, Ming-Chau ^{[1
,2
]}

Wang, Shih-Tien ^{[1
,2
]}

Lu, Henry Horng-Shing ^{[4
,5
]}

Chen, Hung-Hsun ^{[6
]}

机构：

[1] Natl Yang Ming Chiao Tung Univ, Sch Med, Taipei, Taiwan

[2] Taipei Vet Gen Hosp, Dept Orthoped & Traumatol, Taipei, Taiwan

[3] Taipei Vet Gen Hosp, Dept Radiol, Taipei, Taiwan

[4] Natl Yang Ming Chiao Tung Univ, Inst Stat, Assembly Bldg 1,1001 Ta Hsueh Rd, Hsinchu 30010, Taiwan

[5] Natl Yang Ming Chiao Tung Univ, Inst Data Sci & Engn, Hsinchu, Taiwan

[6] Fu Jen Catholic Univ, Program Artificial Intelligence & Informat Secur, 510 Zhongzheng Rd, New Taipei 242062, Taiwan

来源：

SPINE JOURNAL | 2022年 / 22卷 / 04期

关键词：

Artificial intelligence; Deep learning; Generalizability; Ground truth; Limbus; Vertebral fractures; BURST FRACTURES; THORACOLUMBAR; DENSITY;

D O I：

10.1016/j.spinee.2021.10.020

中图分类号：

R74 [神经病学与精神病学];

学科分类号：

摘要：

BACKGROUND CONTEXT: Computer-aided diagnosis with artificial intelligence (AI) has been used clinically, and ground truth generalizability is important for AI performance in medical image analyses. The AI model was trained on one specific group of older adults (aged360) has not yet been shown to work equally well in a younger adult group (aged 18-59). PURPOSE: To compare the performance of the developed AI model with ensemble method trained with the ground truth for those aged 60 years or older in identifying vertebral fractures (VFs) on plain lateral radiographs of spine (PLRS) between younger and older adult populations. STUDY DESIGN/SETTING: Retrospective analysis of PLRS in a single medical institution. OUTCOME MEASURES: Accuracy, sensitivity, specificity, and interobserver reliability (kappa value) were used to compare diagnostic performance of the AI model and subspecialists' consensus between the two groups. METHODS: Between January 2016 and December 2018, the ground truth of 941 patients (one PLRS per person) aged 60 years and older with 1101 VFs and 6358 normal vertebrae was used to set up the AI model. The framework of the developed AI model includes: object detection with You Only Look Once Version 3 (YOLOv3) at T0-L5 levels in the PLRS, data pre-preprocessing with image-size and quality processing, and AI ensemble model (ResNet34, DenseNet121, and DenseNet201) for identifying or grading VFs. The reported overall accuracy, sensitivity and specificity were 92%, 91% and 93%, respectively, and external validation was also performed. Thereafter, patients diagnosed as VFs and treated in our institution during October 2019 to August 2020 were the study group regardless of age. In total, 258 patients (339 VFs and 1725 normal vertebrae) in the older adult population (mean age 78 +/- 10.4; range, 60-106) were enrolled. In the younger adult population (mean age 36 +/- 9.43; range, 20-49), 106 patients (120 VFs and 728 normal vertebrae) were enrolled. After identification and grading of VFs based on the Genant method with consensus between two subspecialists', VFs in each PLRS with human labels were defined as the testing dataset. The corresponding CT or MRI scan was used for labeling in the PLRS. The bootstrap method was applied to the testing dataset. RESULTS: The model for clinical application, Digital Imaging and Communications in Medicine (DICOM) format, is uploaded directly (available at: http://140.113.114.104/vght_demo/svf-model (grading) and http://140.113.114.104/vght demo/svf-model2 (labeling). Overall accuracy, sensitivity and specificity in the older adult population were 93.36% (95% CI 93.34%-93.38%), 88.97% (95% CI 88.59%-88.99%) and 94.26% (95% CI 94.23%-94.29%), respectively. Overall accuracy, sensitivity and specificity in the younger adult population were 93.75% (95% CI 93.7% -93.8%), 65.00% (95% CI 64.33%-65.67%) and 98.49% (95% CI 98.45%-98.52%), respectively. Accuracy reached 100% in VFs grading once the VFs were labeled accurately. The unique pattern of limbus-like VFs, 43 (35.8%) were investigated only in the younger adult population. If limbus-like VFs from the dataset were not included, the accuracy increased from 93.75% (95% CI 93.70%-93.80%) to 95.78% (95% CI 95.73%-95.82%), sensitivity increased from 65.00% (95% CI 64.33%-65.67%) to 70.13% (95% CI 68.98%-71.27%) and specificity remained unchanged at 98.49% (95% CI 98.45%-98.52%), respectively. The main causes of false negative results in older adults were patients' lung markings, diaphragm or bowel airs (37%, n=14) followed by type I fracture (29%, n=11). The main causes of false negatives in younger adults were limbus-like VFs (45%, n=19), followed by type I fracture (26%, n=11). The overall kappa between AI discrimination and subspecialists' consensus in the older and younger adult populations were 0.77 (95% CI, 0.733-0.805) and 0.72 (95% CI, 0.6524-0.80), respectively. CONCLUSIONS: The developed VF-identifying AI ensemble model based on ground truth of older adults achieved better performance in identifying VFs in older adults and non-fractured thoracic and lumbar vertebrae in the younger adults. Different age distribution may have potential disease diversity and implicate the effect of ground truth generalizability on the AI model performance. (C) 2021 Elsevier Inc. All rights reserved.

引用

页码：511 / 523

页数：13

共 25 条

[1]

[Anonymous], 1994, An introduction to the bootstrap, DOI DOI 10.2307/2983304

[2] Lateral vertebral assessment: a valuable technique to detect clinically significant vertebral fractures [J].

Binkley, N ;

Krueger, D ;

Gangnon, R ;

Genant, HK ;

Drezner, MK .

OSTEOPOROSIS INTERNATIONAL, 2005, 16 (12) :1513-1518

[3] Vertebral Body Compression Fractures and Bone Density: Automated Detection and Classification on CT Images [J].

Burns, Joseph E. ;

Yao, Jianhua ;

Summers, Ronald M. .

RADIOLOGY, 2017, 284 (03) :788-797

[4] Application of a deep learning algorithm for detection and visualization of hip fractures on plain pelvic radiographs [J].

Cheng, Chi-Tung ;

Ho, Tsung-Ying ;

Lee, Tao-Yi ;

Chang, Chih-Chen ;

Chou, Ching-Cheng ;

Chen, Chih-Chi ;

Chung, I-Fang ;

Liao, Chien-Hung .

EUROPEAN RADIOLOGY, 2019, 29 (10) :5469-5477

[5] Is removal of the implants needed after fixation of burst fractures of the thoracolumbar and lumbar spine without fusion? A RETROSPECTIVE EVALUATION OF RADIOLOGICAL AND FUNCTIONAL OUTCOMES [J].

Chou, P-H. ;

Ma, H-L. ;

Liu, C-L. ;

Wang, S-T. ;

Lee, O. K. ;

Chang, M-C. ;

Yu, W-K. .

BONE & JOINT JOURNAL, 2016, 98B (01) :109-116

[6] Automated detection and classification of the proximal humerus fracture by using deep learning algorithm [J].

Chung, Seok Won ;

Han, Seung Seog ;

Lee, Ji Whan ;

Oh, Kyung-Soo ;

Kim, Na Ra ;

Yoon, Jong Pil ;

Kim, Joon Yub ;

Moon, Sung Hoon ;

Kwon, Jieun ;

Lee, Hyo-Jin ;

Noh, Young-Min ;

Kim, Youngjun .

ACTA ORTHOPAEDICA, 2018, 89 (04) :468-473

[7]

ETTINGER B, 1992, J BONE MINER RES, V7, P449

[8] Fracture Incidence and Characteristics in Young Adults Aged 18 to 49 Years: A Population-Based Study [J].

Farr, Joshua N. ;

Melton, L. Joseph, III ;

Achenbach, Sara J. ;

Atkinson, Elizabeth J. ;

Khosla, Sundeep ;

Amin, Shreyasee .

JOURNAL OF BONE AND MINERAL RESEARCH, 2017, 32 (12) :2347-2354

[9] Artificial intelligence detection of distal radius fractures: a comparison between the convolutional neural network and professional assessments [J].

Gan, Kaifeng ;

Xu, Dingli ;

Lin, Yimu ;

Shen, Yandong ;

Zhang, Ting ;

Hu, Keqi ;

Zhou, Ke ;

Bi, Mingguang ;

Pan, Lingxiao ;

Wu, Wei ;

Liu, Yunpeng .

ACTA ORTHOPAEDICA, 2019, 90 (04) :394-400

[10] VERTEBRAL FRACTURE ASSESSMENT USING A SEMIQUANTITATIVE TECHNIQUE [J].

GENANT, HK ;

WU, CY ;

VANKUIJK, C ;

NEVITT, MC .

JOURNAL OF BONE AND MINERAL RESEARCH, 1993, 8 (09) :1137-1148

← 1 2 3 →