Automatic Video Captioning via Multi-channel Sequential Encoding

被引:2
作者
Zhang, Chenyang [1 ]
Tian, Yingli [1 ]
机构
[1] CUNY City Coll, Dept Elect Engn, New York, NY 10031 USA
来源
COMPUTER VISION - ECCV 2016 WORKSHOPS, PT II | 2016年 / 9914卷
关键词
Video captioning; Long-short-term-memory; Sequential encoding; American Sign Language;
D O I
10.1007/978-3-319-48881-3_11
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a novel two-stage video captioning framework composed of (1) a multi-channel video encoder and (2) a sentence-generating language decoder. Both of the encoder and decoder are based on recurrent neural networks with long-short-term-memory cells. Our system can take videos of arbitrary lengths as input. Compared with the previous sequence-to-sequence video captioning frameworks, the proposed model is able to handle multiple channels of video representations and jointly learn how to combine them. The proposed model is evaluated on two large-scale movie datasets (MPII Corpus and Montreal Video Description) and one YouTube dataset (Microsoft Video Description Corpus) and achieves the state-of-the-art performances. Furthermore, we extend the proposed model towards automatic American Sign Language recognition. To evaluate the performance of our model on this novel application, a new dataset for ASL video description is collected based on YouTube videos. Results on this dataset indicate that the proposed framework on ASL recognition is promising and will significantly benefit the independent communication between ASL users and others.
引用
收藏
页码:146 / 161
页数:16
相关论文
共 39 条
  • [1] [Anonymous], 2015, P IEEE C COMP VIS PA
  • [2] [Anonymous], P EACL 2014 WORKSH S
  • [3] [Anonymous], 2014, ARXIV14115654
  • [4] [Anonymous], 2015, CVPR
  • [5] [Anonymous], 2014, COLING
  • [6] [Anonymous], 2015, ICCV
  • [7] [Anonymous], 2015, NAACL
  • [8] [Anonymous], 2011, BIGLEARN NIPS WORKSH
  • [9] Rich feature hierarchies for accurate object detection and semantic segmentation
    Girshick, Ross
    Donahue, Jeff
    Darrell, Trevor
    Malik, Jitendra
    [J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 580 - 587
  • [10] [Anonymous], 2013, TRUSTWORTHY COMPUTIN