Multimodal Dance Generation Networks Based on Audio-Visual Analysis

被引:1
|
作者
Duan, Lijuan [1 ]
Xu, Xiao [1 ]
En, Qing [2 ]
机构
[1] Beijing Univ Technol, Beijing, Peoples R China
[2] Beijing Univ Technol, Comp Sci & Technol, Beijing, Peoples R China
来源
INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT | 2021年 / 12卷 / 01期
关键词
3D Pose; Audio-Visual; Classification; Dance Generation; LSTM; Metrics; Mixture Density Networks; Multimodal; Skeleton; VAE;
D O I
10.4018/IJMDEM.2021010102
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
3D human dance generation from music is an interesting and challenging task in which the aim is to estimate 3D pose from visual and audio information. Existing methods only use skeleton information to complete this task, which may cause jittering results. In addition, due to lack of appropriate evaluation metrics for this task, it is difficult to evaluate the quality of the generated results. In this paper, the authors explore multi-modality dance generation networks through constructing the correspondence between the visual and the audio cues. Specifically, they propose a 2D prediction module to predict future frames by fusing visual and audio features. Moreover, they propose a 3D conversion module, which is able to generate the 3D skeleton from the 2D skeleton. In addition, some new human dance generation evaluation metrics are proposed to evaluate the quality of the generated results. Experimental results indicate that the proposed modules can meet the requirements of authenticity and diversity.
引用
收藏
页码:17 / 32
页数:16
相关论文
共 50 条
  • [1] Multicamera audio-visual analysis of dance figures
    Ofli, F.
    Demir, Y.
    Erzin, E.
    Yemez, Y.
    Tekalp, A. M.
    2007 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-5, 2007, : 1703 - 1706
  • [2] Analysis and Synthesis of Multiview Audio-Visual Dance Figures
    Ofli, F.
    Demir, Y.
    Canton-Ferrer, C.
    Tilmanne, J.
    Balci, K.
    Bozkurt, E.
    Kizoglu, I.
    Yemez, Y.
    Erzin, E.
    Tekalp, A. M.
    Akarun, L.
    Erdem, A. T.
    2008 IEEE 16TH SIGNAL PROCESSING, COMMUNICATION AND APPLICATIONS CONFERENCE, VOLS 1 AND 2, 2008, : 761 - +
  • [3] Joint correlation analysis of audio-visual dance figures
    Ofli, F.
    Demir, Y.
    Erzin, E.
    Yemez, Y.
    Tekalp, A. M.
    2007 IEEE 15TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS, VOLS 1-3, 2007, : 604 - 607
  • [4] Audio-visual perception-based multimodal HCI
    Yang, Shu
    Guan, Ye-peng
    JOURNAL OF ENGINEERING-JOE, 2018, (04): : 190 - 198
  • [5] Audio-visual interaction in multimodal communication
    Chellappa, R
    Chen, TH
    Katsaggelos, A
    IEEE SIGNAL PROCESSING MAGAZINE, 1997, 14 (04) : 37 - 38
  • [6] Audio-visual integration in multimodal communication
    Chen, T
    Rao, RR
    PROCEEDINGS OF THE IEEE, 1998, 86 (05) : 837 - 852
  • [7] A System for the Semantic Multimodal Analysis of News Audio-Visual Content
    Mezaris, Vasileios
    Gidaros, Spyros
    Papadopoulos, Georgios Th.
    Kasper, Walter
    Steffen, Joerg
    Ordelman, Roeland
    Huijbregts, Marijn
    de Jong, Franciska
    Kompatsiaris, Ioannis
    Strintzis, Michael G.
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2010,
  • [8] A System for the Semantic Multimodal Analysis of News Audio-Visual Content
    Vasileios Mezaris
    Spyros Gidaros
    GeorgiosTh Papadopoulos
    Walter Kasper
    Jörg Steffen
    Roeland Ordelman
    Marijn Huijbregts
    Franciska de Jong
    Ioannis Kompatsiaris
    MichaelG Strintzis
    EURASIP Journal on Advances in Signal Processing, 2010
  • [9] MUSIC, DANCE AND THEATRE IN AUDIO-VISUAL MEDIA
    不详
    CULTURES, 1973, 1 (01): : 276 - 280
  • [10] Audio-Visual Learning for Multimodal Emotion Recognition
    Fan, Siyu
    Jing, Jianan
    Wang, Chongwen
    SYMMETRY-BASEL, 2025, 17 (03):