共 52 条
[1]
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
[J].
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021),
2021,
:1708-1718
[2]
Bao HB, 2022, Arxiv, DOI [arXiv:2111.02358, DOI 10.48550/ARXIV.2111.02358]
[4]
Chen SJ, 2024, Arxiv, DOI [arXiv:2109.09138, DOI 10.1145/3663363]
[5]
UNITER: UNiversal Image-TExt Representation Learning
[J].
COMPUTER VISION - ECCV 2020, PT XXX,
2020, 12375
:104-120
[6]
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]
Dosovitskiy A., 2020, INT C LEARN REPR, P1
[8]
TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting
[J].
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019),
2019,
:9075-9084
[9]
Good News, Everyone! Context Driven Entity-Aware Captioning for News Images
[J].
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019),
2019,
:12458-12467
[10]
Huang ZC, 2020, Arxiv, DOI [arXiv:2004.00849, 10.48550/arXiv.2004.00849]