共 35 条
- [1] Alamri H., 2022, arXiv, DOI [arXiv:2210.14512, 10.48550/arXiv.2210.14512, DOI 10.48550/ARXIV.2210.14512]
- [2] Bertasius G., 2021, arXiv, DOI [arXiv:2102.05095, 10.48550/arXiv.2102.05095, DOI 10.48550/ARXIV.2102.05095]
- [3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
- [4] Denkowski M., 2014, P 9 WORKSH STAT MACH, P376, DOI DOI 10.3115/V1/W14-3348
- [5] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
- [6] Semantic Compositional Networks for Visual Captioning [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1141 - 1150
- [7] Hershey S, 2017, INT CONF ACOUST SPEE, P131, DOI 10.1109/ICASSP.2017.7952132
- [8] Huang L, 2019, Arxiv, DOI [arXiv:1908.06954, 10.48550/arXiv.1908.06954, DOI 10.48550/ARXIV.1908.06954]
- [9] Multi-modal Dense Video Captioning [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4117 - 4126
- [10] DenseCap: Fully Convolutional Localization Networks for Dense Captioning [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4565 - 4574