共 313 条
[1]
Aafaq N(2022)Dense video captioning with early linguistic information fusion IEEE Trans Multimedia 9 70797-70805
[2]
Mian AS(2021)Optimizing spatiotemporal feature learning in 3D convolutional neural networks with pooling blocks IEEE Access 2012 102-112
[3]
Akhtar N(2015)VQA: visual question answering Proc IEEE Int Conf Comput Vis 2017 328-338
[4]
Liu W(2012)Video in sentences out Uncertainty Artif Intell–Proc 28th Conf–UAI 49 2631-2641
[5]
Shah M(2009)Curriculum learning ACM Int Conf Proc Ser 3024 25-36
[6]
Agyeman R(2017)Natural language processing (almost) from scratch Proc IEEE 3rd Int Conf Collaboration Internet Comput CIC 2017 9 1735-1780
[7]
Rafiq M(2019)Describing video with attention-based bidirectional LSTM IEEE Trans Cybern 95 847-862
[8]
Shin HK(2014)High accuracy optical flow estimation based on warping-presentation Lecture Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 33 8191-8198
[9]
Rinner B(1997)TVT: two-view transformer network for video captioning Long Short–Term Memory 1 8421-8431
[10]
Choi GS(2018)Motion guided spatial attention for video captioning Proc Mach Learn Res 2019 6283-6290