Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video Commenting

被引:3
作者
Fu, Fengyi [1 ]
Fang, Shancheng [1 ]
Chen, Weidong [1 ]
Mao, Zhendong [1 ]
机构
[1] Univ Sci & Technol China, 100 Fuxing Rd, Hefei 230000, Anhui, Peoples R China
基金
中国国家自然科学基金;
关键词
Automatic live video commenting; multi-modal learning; variational autoencoder; batch attention mechanism;
D O I
10.1145/3633334
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic live video commenting is getting increasing attention due to its significance in narration generation, topic explanation, etc. However, the diverse sentiment consideration of the generated comments is missing from current methods. Sentimental factors are critical in interactive commenting, and there has been lack of research so far. Thus, in this article, we propose a Sentiment-oriented Transformer-based Variational Autoencoder (So-TVAE) network, which consists of a sentiment-oriented diversity encoder module and a batch attention module, to achieve diverse video commenting with multiple sentiments and multiple semantics. Specifically, our sentiment-oriented diversity encoder elegantly combines a VAE and random mask mechanism to achieve semantic diversity under sentiment guidance, which is then fused with cross-modal features to generate live video comments. A batch attention module is also proposed in this article to alleviate the problem of missing sentimental samples, caused by the data imbalance that is common in live videos as the popularity of videos varies. Extensive experiments on Livebot and VideoIC datasets demonstrate that the proposed So-TVAE outperforms the state-of-the-art methods in terms of the quality and diversity of generated comments. Related code is available at https://github.com/fufy1024/So-TVAE.
引用
收藏
页数:24
相关论文
共 117 条
  • [1] Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
    Aafaq, Nayyer
    Akhtar, Naveed
    Liu, Wei
    Gilani, Syed Zulqarnain
    Mian, Ajmal
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12479 - 12488
  • [2] Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
    Aafaq, Nayyer
    Mian, Ajmal
    Liu, Wei
    Gilani, Syed Zulqarnain
    Shah, Mubarak
    [J]. ACM COMPUTING SURVEYS, 2020, 52 (06)
  • [3] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [4] Stories That Big Danmaku Data Can Tell as a New Media
    Bai, Qingchun
    Hu, Qinmin Vivian
    Ge, Linhui
    He, Liang
    [J]. IEEE ACCESS, 2019, 7 : 53509 - 53519
  • [5] Banerjee Satanjeev, 2007, METEOR: An automatic metric for MT evaluation with Long short-term memory improved correlation with human judgments, P65
  • [6] Bengio S, 2015, ADV NEUR IN, V28
  • [7] Bowman S. R., 2015, Computer Science, V2015
  • [8] Vision-Enhanced and Consensus-Aware Transformer for Image Captioning
    Cao, Shan
    An, Gaoyun
    Zheng, Zhenxing
    Wang, Zhiyong
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (10) : 7005 - 7018
  • [9] Retrieval Augmented Convolutional Encoder-decoder Networks for Video Captioning
    Chen, Jingwen
    Pan, Yingwei
    Li, Yehao
    Yao, Ting
    Chao, Hongyang
    Mei, Tao
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (01)
  • [10] Chen KZ, 2021, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, P4456