共 36 条
[1]
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
[J].
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR),
2018,
:6077-6086
[2]
Biten A.F., 2021, LaTr: Layout-Aware Transformer for Scene-Text VQA, P16548
[3]
Scene Text Visual Question Answering
[J].
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019),
2019,
:4290-4300
[4]
Rosetta: Large Scale System for Text Detection and Recognition in Images
[J].
KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING,
2018,
:71-79
[6]
UNITER: UNiversal Image-TExt Representation Learning
[J].
COMPUTER VISION - ECCV 2020, PT XXX,
2020, 12375
:104-120
[7]
Cho K., 2014, C EMP METH NAT LANG
[8]
Devlin J., 2018, CORR
[9]
Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
[J].
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020),
2020,
:9989-9999
[10]
Huang Runhui, FILIP FINE GRAINED I