Deep Cross-Modal Audio-Visual Generation

被引:299
作者
Chen, Lele [1 ]
Srivastava, Sudhanshu [1 ]
Duan, Zhiyao [2 ]
Xu, Chenliang [1 ]
机构
[1] Univ Rochester, Comp Sci, Rochester, NY 14627 USA
[2] Univ Rochester, Elect & Comp Engn, Rochester, NY 14627 USA
来源
PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17) | 2017年
关键词
cross-modal generation; audio-visual; generative adversarial networks; PERCEPTION;
D O I
10.1145/3126686.3126723
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite work on computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Specifically, we use conditional generative adversarial networks to achieve cross-modal audio-visual generation of musical performances. We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation. Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments. Our experiments using both classification and human evaluation demonstrate that our model has the ability to generate one modality, i.e., audio/visual, from the other modality, i.e., visual/audio, to a good extent. Our experiments on various design choices along with the datasets will facilitate future research in this new problem space.
引用
收藏
页码:349 / 357
页数:9
相关论文
共 33 条
  • [1] [Anonymous], 2015, INT C LEARN REPR
  • [2] [Anonymous], 2016, IEEE C COMP VIS PATT
  • [3] [Anonymous], 2017, ARXIV170309695
  • [4] [Anonymous], 2011, INT C MACH LEARN
  • [5] [Anonymous], P 30 C NEURAL INFORM
  • [6] [Anonymous], 2012, ADV NEURAL INFORM PR
  • [7] [Anonymous], 2016, INT C MACH LEARN
  • [8] [Anonymous], 2015, ADV NEURAL INFORM PR
  • [9] [Anonymous], 2017, IEEE C COMP VIS PATT
  • [10] [Anonymous], Construction and Analysis of a Large Scale Image Ontology