共 69 条
- [1] [Anonymous], 2022, P IEEE CVF C COMP VI, DOI DOI 10.1109/SPIES55999.2022.10082039
- [2] [Anonymous], 2020, INT C MACH LEARN
- [3] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1708 - 1718
- [4] Bao H., 2021, PROC INT C LEARN REP
- [5] Revisiting the "Video" in Video-Language Understanding [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2907 - 2917
- [6] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
- [7] Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3557 - 3567
- [8] Dosovitskiy A., 2020, PREPRINT
- [9] An Empirical Study of Training End-to-End Vision-and-Language Transformers [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18145 - 18155
- [10] MDMMT: Multidomain Multimodal Transformer for Video Retrieval [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 3349 - 3358