Exploring the influence of transformer-based multimodal modeling on clinicians' diagnosis of skin diseases: A quantitative analysis

被引：1

作者：

Zhang, Yujiao ^{[1
]}

Hu, Yunfeng ^{[1
]}

Li, Ke ^{[2
]}

Pan, Xiangjun ^{[1
]}

Mo, Xiaoling ^{[1
]}

Zhang, Hong ^{[1
]}

机构：

[1] Jinan Univ, Affiliated Hosp 1, Dept Dermatol, 613 Huangpu Ave West, Guangzhou 510630, Guangdong, Peoples R China

[2] Wenzhou Med Univ, Sch Clin Med 1, Wenzhou, Zhejiang, Peoples R China

来源：

DIGITAL HEALTH | 2024年 / 10卷

关键词：

Skin disease; computer-aided diagnosis; quantitative research; soft voting; multimodality; CLASSIFICATION; CANCER; FEASIBILITY; COMPUTER;

D O I：

10.1177/20552076241257087

中图分类号：

R19 [保健组织与事业（卫生事业管理）];

学科分类号：

摘要：

Objectives: The study aimed to propose a multimodal model that incorporates both macroscopic and microscopic images and analyze its influence on clinicians' decision-making with different levels of experience. Methods: First, we constructed a multimodal dataset for five skin disorders. Next, we trained unimodal models on three different types of images and selected the best-performing models as the base learners. Then, we used a soft voting strategy to create the multimodal model. Finally, 12 clinicians were divided into three groups, with each group including one director dermatologist, one dermatologist-in-charge, one resident dermatologist, and one general practitioner. They were asked to diagnose the skin disorders in four unaided situations (macroscopic images only, dermatopathological images only, macroscopic and dermatopathological images, all images and metadata), and three aided situations (macroscopic images with model 1 aid, dermatopathological images with model 2&3 aid, all images with multimodal model 4 aid). The clinicians' diagnosis accuracy and time for each diagnosis were recorded. Results: Among the trained models, the vision transformer (ViT) achieved the best performance, with accuracies of 0.8636, 0.9545, 0.9673, and AUCs of 0.9823, 0.9952, 0.9989 on the training set, respectively. However, on the external validation set, they only achieved accuracies of 0.70, 0.90, and 0.94, respectively. The multimodal model performed well compared to the unimodal models, achieving an accuracy of 0.98 on the external validation set. The results of logit regression analysis indicate that all models are helpful to clinicians in making diagnostic decisions [Odds Ratios (OR) > 1], while metadata does not provide assistance to clinicians (OR < 1). Linear analysis results indicate that metadata significantly increases clinicians' diagnosis time (P < 0.05), while model assistance does not (P > 0.05). Conclusions: The results of this study suggest that the multimodal model effectively improves clinicians' diagnostic performance without significantly increasing the diagnostic time. However, further large-scale prospective studies are necessary.

引用

页数：14

共 1 条

[1] An Empirical Analysis of Transformer-Based and Convolutional Neural Network Approaches for Early Detection and Diagnosis of Cancer Using Multimodal Imaging and Genomic Data
Sangeetha, S. K. B.
Mathivanan, Sandeep Kumar
Muthukumaran, V.
Cho, Jaehyuk
Easwaramoorthy, Sathishkumar Veerappampalayam
IEEE ACCESS, 2025, 13 : 6120 - 6145

← 1 →