共 47 条
[1]
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
[J].
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021,
2021,
:3557-3567
[2]
UNITER: UNiversal Image-TExt Representation Learning
[J].
COMPUTER VISION - ECCV 2020, PT XXX,
2020, 12375
:104-120
[3]
Chen Z., P IEEE CVF C COMP VI, P10086
[4]
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[5]
TransVG: End-to-End Visual Grounding with Transformers
[J].
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021),
2021,
:1749-1759
[6]
Dosovitskiy A., 2020, INT C LEARN REPR, P1
[7]
Gu X., 2021, arXiv
[8]
LVIS: A Dataset for Large Vocabulary Instance Segmentation
[J].
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019),
2019,
:5351-5359
[9]
Ha H., 2022, C ROB LEARN
[10]
He SA, 2022, Arxiv, DOI arXiv:2208.09374