CGFTrans: Cross-Modal Global Feature Fusion Transformer for Medical Report Generation

被引:3
作者
Xu, Liming [1 ,2 ]
Tang, Quan [1 ]
Zheng, Bochuan [1 ]
Lv, Jiancheng [2 ]
Li, Weisheng [3 ]
Zeng, Xianhua [3 ]
机构
[1] China West Normal Univ, Sch Comp Sci, Nanchong 637009, Peoples R China
[2] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China
[3] Chongqing Univ Posts & Telecommun, Coll Comp Sci & Technol, Chongqing Key Lab Image Cognit, Chongqing 400065, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Transformers; Medical diagnostic imaging; Pathology; Visualization; Task analysis; Decoding; Medical report generation; transformer; multimodal learning; feature fusion; global feature;
D O I
10.1109/JBHI.2024.3414413
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Medical report generation, as a cross-modal automatic text generation task, can be highly significant both in research and clinical fields. The core is to generate diagnosis reports in clinical language from medical images. However, several limitations persist, including a lack of global information, inadequate cross-modal fusion capabilities, and high computational demands. To address these issues, we propose cross-modal global feature fusion Transformer (CGFTrans) to extract global information meanwhile reduce computational strain. Firstly, we introduce mesh recurrent network to capture inter-layer information at different levels to address the absence of global features. Then, we design feature fusion decoder and define 'mid-fusion' strategy to separately fuse visual and global features with medical report embeddings, which enhances the ability of the cross-modal joint learning. Finally, we integrate shifted window attention into Transformer encoder to alleviate computational pressure and capture pathological information at multiple scales. Extensive experiments conducted on three datasets demonstrate that the proposed method achieves average increments of 2.9%, 1.5%, and 0.7% in terms of the BLEU-1, METEOR and ROUGE-L metrics, respectively. Besides, it achieves average increments -22.4% and 17.3% training time and images throughput, respectively.
引用
收藏
页码:5600 / 5612
页数:13
相关论文
共 42 条
[1]  
Chen ZH, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P1439
[2]  
Chen Zhihong, 2021, P 59 ANN M ASS COMP, V1, P5904, DOI DOI 10.18653/V1/2021.ACL-LONG.459
[3]   Preparing a collection of radiology examinations for distribution and retrieval [J].
Demner-Fushman, Dina ;
Kohli, Marc D. ;
Rosenman, Marc B. ;
Shooshan, Sonya E. ;
Rodriguez, Laritza ;
Antani, Sameer ;
Thoma, George R. ;
McDonald, Clement J. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2016, 23 (02) :304-310
[4]  
Denkowski Michael., 2011, Proceedings of the EMNLP Workshop on Statistical Machine Translation, P85
[5]  
Irvin J, 2019, AAAI CONF ARTIF INTE, P590
[6]  
Jain S., 2022, P C NEUR INF PROC SY, P4158
[7]  
Ji JY, 2021, AAAI CONF ARTIF INTE, V35, P1655
[8]   MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports [J].
Johnson, Alistair E. W. ;
Pollard, Tom J. ;
Berkowitz, Seth J. ;
Greenbaum, Nathaniel R. ;
Lungren, Matthew P. ;
Deng, Chih-ying ;
Mark, Roger G. ;
Horng, Steven .
SCIENTIFIC DATA, 2019, 6 (1)
[9]  
Li M., 2020, P IEEE C COMP VIS PA, P20656
[10]  
Lin Chin-Yew., TEXT SUMMARIZATION B