iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning

被引:0
|
作者
Lin X. [1 ]
Jin Q. [1 ]
Chen S. [1 ]
机构
[1] Multimedia Computing Laboratory, School of Information, Renmin University of China, Beijing
关键词
Large-scale dataset; Makeup; Video caption; Video segmentation;
D O I
10.3724/SP.J.1089.2019.17343
中图分类号
学科分类号
摘要
Automatically describing images or videos with natural language sentences (a.k.a. image/video captioning) has increasingly received significant attention. Most related works focused on generating one caption sentence for an image or a short video. While most videos in our daily life contain numerous actions or objects de facto, it is hard to describe complicated information involved in these videos with a single sentence. How to learn information from long videos has become a compelling problem. The number of large-scale dataset for such task is limited. Instructional videos are a unique type of videos that have distinct and attractive characteristics for learning. Makeup instructional videos are very popular on commercial video websites. Hence, we present a large-scale makeup instructional video dataset named iMakeup, containing 2 000 videos that are equally distributed over 50 topics. The total duration of this dataset is about 256 hours, containing about 12 823 video clips in total which are segmented based on makeup procedures. We describe the collection and annotation process of our dataset; analyze the scale, the text statistics and diversity in comparison with other video dataset for similar problems. We then present the results of our baseline video caption models on this dataset. The iMakeup dataset contains information from both visual and auditory modalities with a large coverage and diversity of content. Despite for video captioning, it can be used in an extensive range of problems, such as video segmentation, object detection, intelligent fashion recommendation, etc. © 2019, Beijing China Science Journal Publishing Co. Ltd. All right reserved.
引用
收藏
页码:1350 / 1357
页数:7
相关论文
共 41 条
  • [31] Jin Q., Chen J., Chen S.Z., Et al., Describing videos using multi-modal fusion, Proceedings of the 24th ACM International Conference on Multimedia, pp. 1087-1091, (2016)
  • [32] Krishna R., Hata K., Ren F., Et al., Dense-captioning events in videos, Proceedings of the IEEE International Conference on Computer Vision, 1, pp. 706-715, (2017)
  • [33] Liu S., Ou X.Y., Qian R.H., Et al., Makeup like a superstar: deep localized makeup transfer network, Proceedings of the 25th International Joint Conference on Artificial Intelligence, pp. 2568-2575, (2016)
  • [34] Liu L.Q., Xing J.L., Liu S., Et al., Wow! You are so beautiful today!, ACM Transactions on Multimedia Computing, Communications, and Applications, 11, 1, (2014)
  • [35] wikiHow-How to do anything
  • [36] Tran D., Bourdev L., Fergus R., Et al., Learning spatiotemporal features with 3D convolutional networks, Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497, (2015)
  • [37] Karpathy A., Toderici G., Shetty S., Et al., Large-scale video classification with convolutional neural networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725-1732, (2014)
  • [38] Bojanowski P., Lajugie R., Bach F., Et al., Weakly supervised action labeling in videos under ordering constraints, Proceedings of European Conference on Computer Vision, pp. 628-643, (2014)
  • [39] Szegedy C., Ioffe S., Vanhoucke V., Et al., Inception-v4, inception-ResNet and the impact of residual connections on learning
  • [40] Davis S.B., Mermelstein P., Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, Readings in Speech Recognition, pp. 65-74, (1990)