共 37 条
[1]
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
[J].
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR),
2018,
:3674-3683
[2]
[Anonymous], 2018, ANAL MACHINE INTELLI, DOI DOI 10.1109/TPAMI.2017.2754246
[3]
[Anonymous], 2019, ICLR, DOI DOI 10.1159/000501710
[4]
VQA: Visual Question Answering
[J].
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV),
2015,
:2425-2433
[5]
MUTAN: Multimodal Tucker Fusion for Visual Question Answering
[J].
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV),
2017,
:2631-2639
[6]
Brown TB, 2020, ADV NEUR IN, V33
[7]
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8]
Dosovitskiy A., 2020, INT C LEARN REPR
[9]
Gardères F, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, P489
[10]
Focal and Composed Vision-semantic Modeling for Visual Question Answering
[J].
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021,
2021,
:4528-4536