共 82 条
[1]
[Anonymous], 2011, P 49 ANN M ASS COMPU
[2]
ViViT: A Video Vision Transformer
[J].
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021),
2021,
:6816-6826
[3]
MultiMAE: Multi-modal Multi-task Masked Autoencoders
[J].
COMPUTER VISION, ECCV 2022, PT XXXVII,
2022, 13697
:348-367
[4]
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
[J].
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021),
2021,
:1708-1718
[5]
Bao H., 2022, P INT C LEARN REPR
[6]
Bao HB, 2022, Arxiv, DOI arXiv:2206.01127
[7]
Bertasius G, 2021, PR MACH LEARN RES, V139
[8]
Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[9]
Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning
[J].
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020),
2020,
:10635-10644
[10]
UNITER: UNiversal Image-TExt Representation Learning
[J].
COMPUTER VISION - ECCV 2020, PT XXX,
2020, 12375
:104-120