Data clustering to select clinically-relevant test cases for algorithm benchmarking and characterization

被引：7

作者：

Weppler, Sarah ^{[1
,2
]}

Schinkel, Colleen ^{[2
,3
]}

Kirkby, Charles ^{[1
,3
,4
]}

Smith, Wendy ^{[1
,2
,3
]}

机构：

[1] Univ Calgary, Dept Phys & Astron, Calgary, AB T2N 1N4, Canada

[2] Tom Baker Canc Clin, Dept Med Phys, 1331 29 St NW, Calgary, AB T2N 4N2, Canada

[3] Univ Calgary, Dept Oncol, 2500 Univ Dr NW, Calgary, AB T2N 1N4, Canada

[4] Jack Ady Canc Ctr, Dept Med Phys, 960 19 St S, Lethbridge, AB T1J 1W5, Canada

来源：

PHYSICS IN MEDICINE AND BIOLOGY | 2020年 / 65卷 / 05期

基金：

加拿大自然科学与工程研究理事会;

关键词：

algorithm benchmarking; data clustering; test case selection; DEFORMABLE IMAGE REGISTRATION; RADIATION-THERAPY; RADIOTHERAPY; ACCURACY; PHANTOM; HEAD;

D O I：

10.1088/1361-6560/ab6e54

中图分类号：

R318 [生物医学工程];

学科分类号：

0831 ;

摘要：

Algorithm benchmarking and characterization are an important part of algorithm development and validation prior to clinical implementation. However, benchmarking may be limited to a small collection of test cases due to the resource-intensive nature of establishing 'ground-truth' references. This study proposes a framework for selecting test cases to assess algorithm and workflow equivalence. Effective test case selection may minimize the number of ground-truth comparisons required to establish robust and clinically relevant benchmarking and characterization results. To demonstrate the proposed framework, we clustered differences between two independent workflows estimating during-treatment dose objective violations for 15 head and neck cancer patients (15 planning CTs, 105 on-unit CBCTs). Each workflow used a different deformable image registration algorithm to estimate inter-fractional anatomy and contour changes. The Hopkins statistic tested whether workflow output was inherently clustered and k-medoid clustering formalized cluster assignment. Further statistical analyses verified the relevance of clusters to algorithm output. Data at cluster centers ('medoids') were considered as candidate test cases representative of workflow-relevant algorithm differences. The framework indicated that differences in estimated dose objective violations were naturally grouped (Hopkins = 0.75, providing 90% confidence). K-medoid clustering identified five clusters which stratified workflow differences (MANOVA: p < 0.001) in estimated parotid gland D50%, spinal cord/brainstem Dmax, and high dose CTV coverage dose violations (Kendall's tau: p < 0.05). Systematic algorithm differences resulting in workflow discrepancies were: parotid gland volumes (ANOVA: p < 0.001), external contour deformations (t-test: p = 0.022), and CTV-to-PTV margins (t-test: 0.009), respectively. Five candidate test cases were verified as representative of the five clusters. The framework successfully clustered workflow outputs and identified five test cases representative of clinically relevant algorithm discrepancies. This approach may improve the allocation of resources during the benchmarking and characterization process and the applicability of results to clinical data.

引用

页数：12

共 25 条

[1]

[Anonymous], 1973, Pattern Classification and Scene Analysis

[2] CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].