A Dual-Attention Learning Network With Word and Sentence Embedding for Medical Visual Question Answering

被引:4
作者
Huang, Xiaofei [1 ]
Gong, Hongfang [1 ]
机构
[1] Changsha Univ Sci & Technol, Sch Math & Stat, Changsha 410114, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Visualization; Medical diagnostic imaging; Data mining; Question answering (information retrieval); Task analysis; Cognition; Medical visual question answering; double embedding; medical information; guided attention; visual reasoning; MODEL;
D O I
10.1109/TMI.2023.3322868
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Research in medical visual question answering (MVQA) can contribute to the development of computer-aided diagnosis. MVQA is a task that aims to predict accurate and convincing answers based on given medical images and associated natural language questions. This task requires extracting medical knowledge-rich feature content and making fine-grained understandings of them. Therefore, constructing an effective feature extraction and understanding scheme are keys to modeling. Existing MVQA question extraction schemes mainly focus on word information, ignoring medical information in the text, such as medical concepts and domain-specific terms. Meanwhile, some visual and textual feature understanding schemes cannot effectively capture the correlation between regions and keywords for reasonable visual reasoning. In this study, a dual-attention learning network with word and sentence embedding (DALNet-WSE) is proposed. We design a module, transformer with sentence embedding (TSE), to extract a double embedding representation of questions containing keywords and medical information. A dual-attention learning (DAL) module consisting of self-attention and guided attention is proposed to model intensive intramodal and intermodal interactions. With multiple DAL modules (DALs), learning visual and textual co-attention can increase the granularity of understanding and improve visual reasoning. Experimental results on the ImageCLEF 2019 VQA-MED (VQA-MED 2019) and VQA-RAD datasets demonstrate that our proposed method outperforms previous state-of-the-art methods. According to the ablation studies and Grad-CAM maps, DALNet-WSE can extract rich textual information and has strong visual reasoning ability.
引用
收藏
页码:832 / 845
页数:14
相关论文
共 58 条
  • [1] Abacha A. B., 2019, CLEF2019 WORKING NOT, V2, P1
  • [2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [3] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [4] Enhancing the Quality of Image Tagging Using a Visio-Textual Knowledge Base
    Chaudhary, Chandramani
    Goyal, Poonam
    Prasad, Dhanashree Nellayi
    Chen, Yi-Ping Phoebe
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (04) : 897 - 911
  • [5] Multimodal Encoder-Decoder Attention Networks for Visual Question Answering
    Chen, Chongqing
    Han, Dezhi
    Wang, Jun
    [J]. IEEE ACCESS, 2020, 8 : 35662 - 35671
  • [6] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [7] Multiple Meta-model Quantifying for Medical Visual Question Answering
    Do, Thong
    Nguyen, Binh X.
    Tjiputra, Erman
    Tran, Minh
    Tran, Quang D.
    Anh Nguyen
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT V, 2021, 12905 : 64 - 74
  • [8] Deep Irregular Convolutional Residual LSTM for Urban Traffic Passenger Flows Prediction
    Du, Bowen
    Peng, Hao
    Wang, Senzhang
    Bhuiyan, Md Zakirul Alam
    Wang, Lihong
    Gong, Qiran
    Liu, Lin
    Li, Jing
    [J]. IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2020, 21 (03) : 972 - 985
  • [9] Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering
    Duy-Kien Nguyen
    Okatani, Takayuki
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6087 - 6096
  • [10] Finn C, 2017, PR MACH LEARN RES, V70