共 43 条
[21]
Bain M, Nagrani A, Varol G, Et al., Frozen in time: A joint video and image encoder for end-to-end retrieval, Proc. of the IEEE/ CVF Int’l Conf. on Computer Vision, pp. 1728-1738, (2021)
[22]
Miech A, Zhukov D, Alayrac JB, Et al., HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips, Proc. of the IEEE/CVF Int’l Conf. on Computer Vision, pp. 2630-2640, (2019)
[23]
Miech A, Alayrac JB, Smaira L, Et al., End-to-end learning of visual representations from uncurated instructional videos, Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 9879-9889, (2020)
[24]
Wu W, Sun Z, Ouyang W., Revisiting classifier: Transferring vision-language models for video recognition, Proc. of the AAAI Conf. on Artificial Intelligence, 37, 3, pp. 2847-2855, (2023)
[25]
Zhao S, Zhu L, Wang X, Et al., CenterCLIP: Token clustering for efficient text-video retrieval, Proc. of the 45th Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 970-981, (2022)
[26]
Bain M, Nagrani A, Varol G, Et al., A CLIP-Hitchhiker’s guide to long video retrieval, (2022)
[27]
Vaswani A, Shazeer N, Parmar N, Et al., Attention is all you need, Advances in Neural Information Processing Systems, pp. 5998-6008, (2017)
[28]
Devlin J, Chang MW, Lee K, Et al., Bert: Pre-training of deep bidirectional transformers for language understanding, (2018)
[29]
Radford A, Narasimhan K, Salimans T, Et al., Improving language understanding by generative pre-training, (2018)
[30]
Radford A, Wu J, Child R, Et al., Language models are unsupervised multitask learners