Video Captioning Based on C3D and Visual Elements

被引：1

作者：

Xiao H. ^{[1
]}

Shi J. ^{[1
]}

机构：

[1] School of Electronic and Information Engineering, South China University of Technology, Guangzhou, 510640, Guangdong

来源：

Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science) | 2018年 / 46卷 / 08期

基金：

中国国家自然科学基金;

关键词：

Convolutional neural network; Deep learning; Recurrent neural networks; Self-adaptive; Video captioning; Visual elements;

D O I：

10.3969/j.issn.1000-565X.2018.08.013

中图分类号：

学科分类号：

摘要：

With the development of deep learning, the approach that extracts video feature using convolutional neural networks(CNNs) and generates sentences using recurrent neural networks(RNNs) is widely used in video captioning task. However, this direct translation ignores many intrinsic information of video, such as temporal information, motion information, and abundant visual element information. Therefore, a multi-modality video caption model based on self-adaptive frame cycle filling algorithm(AFCF-MVC) is proposed in this paper, which uses the self-adaptive feature extraction algorithm to extract the video C3D features that contain rich spatial-temporal information, and then utilizes them as inputs to the neural network. At the same time, the self-adaptive feature extraction algorithm can exploit the whole video information. Owing to the lengths of video references are different, a self-adaptive frame cycle filling algorithm is proposed to adaptively control the number of input features according to the length of the annotation sentence, which provides as many feature inputs as possible for the neural network under the premise of ensuring the complete input of the sentence. Moreover, it also plays the role of repeated learning. In order to make use of the rich visual elements of videos, the visual elements of video frames are detected by a visual detector, and then they are encoded into the network as supplementary information. Experimental results on M-VAD and MPII-MD datasets show that the proposed method not only describes the video content correctly, but also can simulate the richness of human language. © 2018, Editorial Department, Journal of South China University of Technology. All right reserved.

引用

页码：88 / 95

页数：7

共 26 条

[1]

Kojima A., Izumi M., Tamura T., Et al., Generating natural language description of human behavior from video images, Proceedings of 2000 International Conference on Pattern Recognition, pp. 728-731, (2000)

[2]

Aradhye H., Toderici G., Yagnik J., Video2Text: learning to annotate video content, Proceedings of 2009 IEEE International Conference on Data Mining Workshops, pp. 144-151, (2009)

[3]

Krishnamoorthy N., Malkarnenkar G., Mooney R., Et al., Generating natural-language video descriptions using text-mined knowledge, Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, pp. 541-547, (2013)

[4]

Szegedy C., Liu W., Jia Y., Et al., Going deeper with convolutions, Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-9, (2015)

[5]

Venugopalan S., Xu H., Donahue J., Et al., Translating videos to natural language using deep recurrent neural networks, Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL, pp. 1494-1504, (2015)

[6]

Donahue J., Hendricks L.A., Rohrbach M., Et al., Long-term recurrent convolutional networks for visual recognition and description, IEEE Transactions on Pattern Analysis & Machine Intelligence, 39, 4, pp. 677-691, (2017)

[7]

Sutskever I., Vinyals O., Le Q.V., Sequence to sequence learning with neural networks, Proceedings of 2014 International Conference on Neural Information Processing Systems, pp. 3104-3112, (2014)

[8]

Hochreiter S., Schmidhuber J., Long short-term memory, Neural Computation, 9, 8, pp. 1735-1780, (1997)

[9]

Yao L., Torabi A., Cho K., Et al., Describing videos by exploiting temporal structure, Proceedings of 2015 IEEE International Conference on Computer Vision, pp. 4507-4515, (2015)

[10]

Zhang H.-G., Li H., Chinese word segmentation method on the basis of bidirectional long-short term memory model, Journal of South China University of Technology(Natural Science Edition), 45, 3, pp. 61-67, (2017)

← 1 2 3 →