共 56 条
- [2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
- [3] VQA: Visual Question Answering [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
- [4] MUREL: Multimodal Relational Reasoning for Visual Question Answering [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1989 - 1998
- [5] Carion N., 2020, EUR C COMP VIS, DOI [10.1007/978-3-030-58452-8, 10., DOI 10.1007/978-3-030-58452-813]
- [7] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12114 - 12124
- [8] Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6087 - 6096
- [9] Stacked Latent Attention for Multimodal Reasoning [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1072 - 1080
- [10] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12053 - 12062