Multimodal Dance Generation Networks Based on Audio-Visual Analysis

被引：1

作者：

Duan, Lijuan ^{[1
]}

Xu, Xiao ^{[1
]}

En, Qing ^{[2
]}

机构：

[1] Beijing Univ Technol, Beijing, Peoples R China

[2] Beijing Univ Technol, Comp Sci & Technol, Beijing, Peoples R China

来源：

INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT | 2021年 / 12卷 / 01期

关键词：

3D Pose; Audio-Visual; Classification; Dance Generation; LSTM; Metrics; Mixture Density Networks; Multimodal; Skeleton; VAE;

D O I：

10.4018/IJMDEM.2021010102

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

3D human dance generation from music is an interesting and challenging task in which the aim is to estimate 3D pose from visual and audio information. Existing methods only use skeleton information to complete this task, which may cause jittering results. In addition, due to lack of appropriate evaluation metrics for this task, it is difficult to evaluate the quality of the generated results. In this paper, the authors explore multi-modality dance generation networks through constructing the correspondence between the visual and the audio cues. Specifically, they propose a 2D prediction module to predict future frames by fusing visual and audio features. Moreover, they propose a 3D conversion module, which is able to generate the 3D skeleton from the 2D skeleton. In addition, some new human dance generation evaluation metrics are proposed to evaluate the quality of the generated results. Experimental results indicate that the proposed modules can meet the requirements of authenticity and diversity.

引用

页码：17 / 32

页数：16

共 50 条

[1] Multicamera audio-visual analysis of dance figures
Ofli, F.
Demir, Y.
Erzin, E.
Yemez, Y.
Tekalp, A. M.
2007 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-5, 2007, : 1703 - 1706
[2] Analysis and Synthesis of Multiview Audio-Visual Dance Figures
Ofli, F.
Demir, Y.
Canton-Ferrer, C.
Tilmanne, J.
Balci, K.
Bozkurt, E.
Kizoglu, I.
Yemez, Y.
Erzin, E.
Tekalp, A. M.
Akarun, L.
Erdem, A. T.
2008 IEEE 16TH SIGNAL PROCESSING, COMMUNICATION AND APPLICATIONS CONFERENCE, VOLS 1 AND 2, 2008, : 761 - +
[3] Joint correlation analysis of audio-visual dance figures
Ofli, F.
Demir, Y.
Erzin, E.
Yemez, Y.
Tekalp, A. M.
2007 IEEE 15TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS, VOLS 1-3, 2007, : 604 - 607
[4] Audio-visual perception-based multimodal HCI
Yang, Shu
Guan, Ye-peng
JOURNAL OF ENGINEERING-JOE, 2018, (04): : 190 - 198
[5] Audio-visual interaction in multimodal communication
Chellappa, R
Chen, TH
Katsaggelos, A
IEEE SIGNAL PROCESSING MAGAZINE, 1997, 14 (04) : 37 - 38
[6] Audio-visual integration in multimodal communication
Chen, T
Rao, RR
PROCEEDINGS OF THE IEEE, 1998, 86 (05) : 837 - 852
[7] A System for the Semantic Multimodal Analysis of News Audio-Visual Content
Mezaris, Vasileios
Gidaros, Spyros
Papadopoulos, Georgios Th.
Kasper, Walter
Steffen, Joerg
Ordelman, Roeland
Huijbregts, Marijn
de Jong, Franciska
Kompatsiaris, Ioannis
Strintzis, Michael G.
EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2010,
[8] A System for the Semantic Multimodal Analysis of News Audio-Visual Content
Vasileios Mezaris
Spyros Gidaros
GeorgiosTh Papadopoulos
Walter Kasper
Jörg Steffen
Roeland Ordelman
Marijn Huijbregts
Franciska de Jong
Ioannis Kompatsiaris
MichaelG Strintzis
EURASIP Journal on Advances in Signal Processing, 2010
[9] MUSIC, DANCE AND THEATRE IN AUDIO-VISUAL MEDIA
不详
CULTURES, 1973, 1 (01): : 276 - 280
[10] Audio-Visual Learning for Multimodal Emotion Recognition
Fan, Siyu
Jing, Jianan
Wang, Chongwen
SYMMETRY-BASEL, 2025, 17 (03):

← 1 2 3 4 5 →