Cross-Modal Augmented Transformer for Automated Medical Report Generation

被引：0

作者：

Tang, Yuhao ^{[1
]}

Yuan, Ye ^{[2
]}

Tao, Fei ^{[3
]}

Tang, Minghao ^{[4
]}

机构：

[1] Jiangsu Police Inst, Nanjing 210031, Peoples R China

[2] Ind & Commercial Bank China, Jiangsu Prov Branch, Nanjing 210006, Peoples R China

[3] Yangzhou Intermediate Peoples Court Jiangsu Prov, Yangzhou 225009, Peoples R China

[4] First Peoples Hosp Jiashan, Jiaxing 314100, Peoples R China

来源：

IEEE JOURNAL OF TRANSLATIONAL ENGINEERING IN HEALTH AND MEDICINE | 2025年 / 13卷

基金：

美国国家科学基金会;

关键词：

Medical report generation; medical imaging; automatic diagnosis; clinical automation; image captioning;

D O I：

10.1109/JTEHM.2025.3536441

中图分类号：

R318 [生物医学工程];

学科分类号：

0831 ;

摘要：

In clinical practice, interpreting medical images and composing diagnostic reports typically involve significant manual workload. Therefore, an automated report generation framework that mimics a doctor's diagnosis better meets the requirements of medical scenarios. Prior investigations often overlook this critical aspect, primarily relying on traditional image captioning frameworks initially designed for general-domain images and sentences. Despite achieving some advancements, these methodologies encounter two primary challenges. First, the strong noise in blurred medical images always hinders the model of capturing the lesion region. Second, during report writing, doctors typically rely on terminology for diagnosis, a crucial aspect that has been neglected in prior frameworks. In this paper, we present a novel approach called Cross-modal Augmented Transformer (CAT) for medical report generation. Unlike previous methods that rely on coarse-grained features without human intervention, our method introduces a "locate then generate" pattern, thereby improving the interpretability of the generated reports. During the locate stage, CAT captures crucial representations by pre-aligning significant patches and their corresponding medical terminologies. This pre-alignment helps reduce visual noise by discarding low-ranking content, ensuring that only relevant information is considered in the report generation process. During the generation phase, CAT utilizes a multi-modality encoder to reinforce the correlation between generated keywords, retrieved terminologies and regions. Furthermore, CAT employs a dual-stream decoder that dynamically determines whether the predicted word should be influenced by the retrieved terminology or the preceding sentence. Experimental results demonstrate the effectiveness of the proposed method on two datasets.Clinical impact: This work aims to design an automated framework for explaining medical images to evaluate the health status of individuals, thereby facilitating their broader application in clinical settings.Clinical and Translational Impact Statement: In our preclinical research, we develop an automated system for generating diagnostic reports. This system mimics manual diagnostic methods by combining fine-grained semantic alignment with dual-stream decoders.

引用

页码：33 / 48

页数：16

共 51 条

[1]

Alam HMT, 2025, Arxiv, DOI arXiv:2412.16086

[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[3]

Banerjee S., 2005, P ACL WORKSH INTR EX, P65

[4]

Chen SZ, 2020, PROC CVPR IEEE, P9959, DOI 10.1109/CVPR42600.2020.00998

[5]

Chen ZH, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P1439

[6]

Cornia M, 2020, PROC CVPR IEEE, P10575, DOI 10.1109/CVPR42600.2020.01059

[7] Cross-Domain Image Captioning with Discriminative Finetuning [J].

Dessi, Roberto ;

Bevilacqua, Michele ;

Gualdoni, Eleonora ;

Carraz Rakotonirina, Nathanael ;

Franzon, Francesca ;

Baroni, Marco .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, :6935-6944

[8] IIHT: Medical Report Generation with Image-to-Indicator Hierarchical Transformer [J].

Fan, Keqiang ;

Cai, Xiaohao ;

Niranjan, Mahesan .

NEURAL INFORMATION PROCESSING, ICONIP 2023, PT VI, 2024, 14452 :57-71

[9]

Fan YJ, 2024, Arxiv, DOI arXiv:2409.00250

[10]

Han QH, 2024, Arxiv, DOI arXiv:2412.07141

← 1 2 3 4 5 6 →