共 58 条
[1]
Amrani E(2021)Noise estimation using density estimation for self-supervised multimodal learning Proc AAAI Conf Artif Intell 35 6644-6652
[2]
Ben-Ari R(1994)Learning long-term dependencies with gradient descent is difficult IEEE Trans Neural Netw 5 157-166
[3]
Rotman D(2021)Is space-time attention all you need for video understanding In ICML 2 4-3388
[4]
Bronstein A(2018)Predicting visual features from text for image and video caption retrieval IEEE Trans Multimedia 20 3377-10761
[5]
Bengio Y(2020)Person tube retrieval via language description Proc AAAI Conf Artif Intell 34 10754-22618
[6]
Simard P(2020)Coot: cooperative hierarchical transformer for video-text representation learning Adv Neural Inf Process Syst 33 22605-1780
[7]
Frasconi P(1997)Long short-term memory Neural Comput 9 1735-90
[8]
Bertasius G(2012)Imagenet classification with deep convolutional neural networks Adv Neural Inf Process Syst 25 84-13949
[9]
Wang H(2021)Dynamicvit: efficient vision transformers with dynamic token sparsification Advances in neural information processing systems 34 13937-120
[10]
Torresani L(2017)Movie description Int J Comput Vis 123 94-6010