A Dual-Attention Learning Network With Word and Sentence Embedding for Medical Visual Question Answering

被引：4

作者：

Huang, Xiaofei ^{[1
]}

Gong, Hongfang ^{[1
]}

机构：

[1] Changsha Univ Sci & Technol, Sch Math & Stat, Changsha 410114, Peoples R China

来源：

IEEE TRANSACTIONS ON MEDICAL IMAGING | 2024年 / 43卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Feature extraction; Visualization; Medical diagnostic imaging; Data mining; Question answering (information retrieval); Task analysis; Cognition; Medical visual question answering; double embedding; medical information; guided attention; visual reasoning; MODEL;

D O I：

10.1109/TMI.2023.3322868

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Research in medical visual question answering (MVQA) can contribute to the development of computer-aided diagnosis. MVQA is a task that aims to predict accurate and convincing answers based on given medical images and associated natural language questions. This task requires extracting medical knowledge-rich feature content and making fine-grained understandings of them. Therefore, constructing an effective feature extraction and understanding scheme are keys to modeling. Existing MVQA question extraction schemes mainly focus on word information, ignoring medical information in the text, such as medical concepts and domain-specific terms. Meanwhile, some visual and textual feature understanding schemes cannot effectively capture the correlation between regions and keywords for reasonable visual reasoning. In this study, a dual-attention learning network with word and sentence embedding (DALNet-WSE) is proposed. We design a module, transformer with sentence embedding (TSE), to extract a double embedding representation of questions containing keywords and medical information. A dual-attention learning (DAL) module consisting of self-attention and guided attention is proposed to model intensive intramodal and intermodal interactions. With multiple DAL modules (DALs), learning visual and textual co-attention can increase the granularity of understanding and improve visual reasoning. Experimental results on the ImageCLEF 2019 VQA-MED (VQA-MED 2019) and VQA-RAD datasets demonstrate that our proposed method outperforms previous state-of-the-art methods. According to the ablation studies and Grad-CAM maps, DALNet-WSE can extract rich textual information and has strong visual reasoning ability.

引用

页码：832 / 845

页数：14

共 58 条

[1] Abacha A. B., 2019, CLEF2019 WORKING NOT, V2, P1
[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Anderson, Peter
He, Xiaodong
Buehler, Chris
Teney, Damien
Johnson, Mark
Gould, Stephen
Zhang, Lei
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
[3] VQA: Visual Question Answering
Antol, Stanislaw
Agrawal, Aishwarya
Lu, Jiasen
Mitchell, Margaret
Batra, Dhruv
Zitnick, C. Lawrence
Parikh, Devi
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
[4] Enhancing the Quality of Image Tagging Using a Visio-Textual Knowledge Base
Chaudhary, Chandramani
Goyal, Poonam
Prasad, Dhanashree Nellayi
Chen, Yi-Ping Phoebe
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (04) : 897 - 911
[5] Multimodal Encoder-Decoder Attention Networks for Visual Question Answering
Chen, Chongqing
Han, Dezhi
Wang, Jun
[J]. IEEE ACCESS, 2020, 8 : 35662 - 35671
[6] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7] Multiple Meta-model Quantifying for Medical Visual Question Answering
Do, Thong
Nguyen, Binh X.
Tjiputra, Erman
Tran, Minh
Tran, Quang D.
Anh Nguyen
[J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT V, 2021, 12905 : 64 - 74
[8] Deep Irregular Convolutional Residual LSTM for Urban Traffic Passenger Flows Prediction
Du, Bowen
Peng, Hao
Wang, Senzhang
Bhuiyan, Md Zakirul Alam
Wang, Lihong
Gong, Qiran
Liu, Lin
Li, Jing
[J]. IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2020, 21 (03) : 972 - 985
[9] Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering
Duy-Kien Nguyen
Okatani, Takayuki
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6087 - 6096
[10] Finn C, 2017, PR MACH LEARN RES, V70

← 1 2 3 4 5 6 →