Vision Transformer and Language Model Based Radiology Report Generation

被引：16

作者：

Mohsan, Mashood Mohammad ^{[1
]}

Akram, Muhammad Usman ^{[1
]}

Rasool, Ghulam ^{[2
]}

Alghamdi, Norah Saleh ^{[3
]}

Baqai, Muhammad Abdullah Aamer ^{[4
]}

Abbas, Muhammad ^{[1
]}

机构：

[1] Natl Univ Sci & Technol, Dept Comp & Software Engn, Islamabad 44000, Pakistan

[2] H Lee Moffitt Canc Ctr & Res Inst, Machine Learning Dept, Tampa, FL 33612 USA

[3] Princess Nourah Bint Abdulrahman Univ, Coll Comp & Informat Sci, Dept Comp Sci, Riyadh 11671, Saudi Arabia

[4] Michigan State Univ, Coll Engn, E Lansing, MI 48824 USA

来源：

IEEE ACCESS | 2023年 / 11卷

关键词：

Vision transformers; language models; radiology report; decoder;

D O I：

10.1109/ACCESS.2022.3232719

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recent advancements in transformers exploited computer vision problems which results in state-of-the-art models. Transformer-based models in various sequence prediction tasks such as language translation, sentiment classification, and caption generation have shown remarkable performance. Auto report generation scenarios in medical imaging through caption generation models is one of the applied scenarios for language models and have strong social impact. In these models, convolution neural networks have been used as encoder to gain spatial information and recurrent neural networks are used as decoder to generate caption or medical report. However, using transformer architecture as encoder and decoder in caption or report writing task is still unexplored. In this research, we explored the effect of losing spatial biasness information in encoder by using pre-trained vanilla image transformer architecture and combine it with different pre-trained language transformers as decoder. In order to evaluate the proposed methodology, the Indiana University Chest X-Rays dataset is used where ablation study is also conducted with respect to different evaluations. The comparative analysis shows that the proposed methodology has represented remarkable performance when compared with existing techniques in terms of different performance parameters.

引用

页码：1814 / 1824

页数：11

共 36 条

[1] Alfarghaly Omar, 2021, Informatics in Medicine Unlocked, V24, DOI 10.1016/j.imu.2021.100557
[2] Allaouzi I., 2018, P 3 INT C SMART CITY, P1
[3] [Anonymous], 2020, ARXIV
[4] Emerging Properties in Self-Supervised Vision Transformers
Caron, Mathilde
Touvron, Hugo
Misra, Ishan
Jegou, Herve
Mairal, Julien
Bojanowski, Piotr
Joulin, Armand
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9630 - 9640
[5] Chen ZH, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P1439
[6] Delrue L., 2011, COMP INTERPRETATION, P27, DOI [DOI 10.1007/978-3-540-79942-9_2, DOI 10.1007/978-3-540-79942-92]
[7] Preparing a collection of radiology examinations for distribution and retrieval
Demner-Fushman, Dina
Kohli, Marc D.
Rosenman, Marc B.
Shooshan, Sonya E.
Rodriguez, Laritza
Antani, Sameer
Thoma, George R.
McDonald, Clement J.
[J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2016, 23 (02) : 304 - 310
[8] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9] Chest diseases diagnosis using artificial neural networks
Er, Orhan
Yumusak, Nejat
Temurtas, Feyzullah
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2010, 37 (12) : 7648 - 7655
[10] Mortality from Aspiration Pneumonia: Incidence, Trends, and Risk Factors
Gupte, Trisha
Knack, Arthur
Cramer, John D.
[J]. DYSPHAGIA, 2022, 37 (06) : 1493 - 1500

← 1 2 3 4 →