共 26 条
[1]
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
[J].
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021),
2021,
:1708-1718
[2]
Cheng Xing., 2021, arXiv
[3]
TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval
[J].
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021),
2021,
:11563-11573
[4]
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[5]
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
[J].
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021,
2021,
:3349-3358
[6]
Multi-modal Transformer for Video Retrieval
[J].
COMPUTER VISION - ECCV 2020, PT IV,
2020, 12349
:214-229
[7]
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
[J].
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022),
2022,
:4996-5005
[8]
Goyal P, 2018, Arxiv, DOI [arXiv:1706.02677, 10.48550/arXiv.1706.02677, DOI 10.48550/ARXIV.1706.02677]
[9]
Kingma DP, 2014, ADV NEUR IN, V27
[10]
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling
[J].
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021,
2021,
:7327-7337