共 25 条
- [1] Chen H R, Lin K, Maye A, Et al., A semantics-assisted video captioning model trained with scheduled sampling, Frontiers in Robotics and AI, 7, (2020)
- [2] Tu Y B, Zhang X S, Liu B T, Et al., Video description with spatial-temporal attention, Proceedings of the 25th ACM International Conference on Multimedia, pp. 1014-1022, (2017)
- [3] Zheng Q, Wang C Y, Tao D C., Syntax-aware action targeting for video captioning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 13093-13102, (2020)
- [4] Zhang J C, Peng Y X., Video captioning with object-aware spatio-temporal correlation and aggregation, IEEE Transactions on Image Processing, 29, pp. 6209-6222, (2020)
- [5] Li Yao, Torabi A, Cho K, Et al., Describing videos by exploiting temporal structure, Proceedings of the IEEE International Conference on Computer Vision, pp. 4507-4515, (2015)
- [6] Zolfaghari M, Singh K, Brox T., ECO: efficient convolutional network for online video understanding, Proceedings of the European Conference on Computer Vision, pp. 713-730, (2018)
- [7] Gan Z, Gan C, He X D, Et al., Semantic compositional networks for visual captioning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1141-1150, (2017)
- [8] Xu G H, Niu S C, Tan M K, Et al., Towards accurate text-based image captioning with content diversity exploration, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12632-12641, (2021)
- [9] Jiang W H, Zhu M W, Fang Y M, Et al., Visual cluster grounding for image captioning, IEEE Transactions on Image Processing, 31, pp. 3920-3934, (2022)
- [10] Venugopalan S, Rohrbach M, Donahue J, Et al., Sequence to sequence-video to text, Proceedings of the IEEE International Conference on Computer Vision, pp. 4534-4542, (2015)