Towards Audio-Visual Saliency Prediction for Omnidirectional Video with Spatial Audio

被引:16
作者
Chao, Fang-Yi [1 ]
Ozcinar, Cagri [2 ]
Zhang, Lu [1 ]
Hamidouche, Wassim [1 ]
Deforges, Olivier [1 ]
Smolic, Aljosa [2 ]
机构
[1] Univ Rennes, CNRS, INSA Rennes, IETR UMR 6164, F-35000 Rennes, France
[2] Trinity Coll Dublin, Sch Comp Sci & Stat, V SENSE, Dublin, Ireland
来源
2020 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP) | 2020年
基金
爱尔兰科学基金会;
关键词
Audio-visual saliency; spatial sound; ambisonics; omnidirectional video (ODV); virtual reality (VR);
D O I
10.1109/vcip49819.2020.9301766
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Omnidirectional videos (ODVs) with spatial audio enable viewers to perceive 360 degrees directions of audio and visual signals during the consumption of ODVs with head-mounted displays (HMDs). By predicting salient audio-visual regions, ODV systems can be optimized to provide an immersive sensation of audio-visual stimuli with high-quality. Despite the intense recent effort for ODV saliency prediction, the current literature still does not consider the impact of auditory information in ODVs. In this work, we propose an audio-visual saliency (AVS360) model that incorporates 360 degrees spatial-temporal visual representation and spatial auditory information in ODVs. The proposed AVS360 model is composed of two 3D residual networks (ResNets) to encode visual and audio cues. The first one is embedded with a spherical representation technique to extract 360 degrees visual features, and the second one extracts the features of audio using the log mel-spectrogram. We emphasize sound source locations by integrating audio energy map (AEM) generated from spatial audio description (i.e., ambisonics) and equator viewing behavior with equator center bias (ECB). The audio and visual features are combined and fused with AEM and ECB via attention mechanism. Our experimental results show that the AVS360 model has significant superiority over five state-of-the-art saliency models. To the best of our knowledge, it is the first work that develops the audio-visual saliency model in ODVs. The code will be publicly available to foster future research on audio-visual saliency in ODVs.
引用
收藏
页码:355 / 358
页数:4
相关论文
共 13 条
[1]   Saliency Prediction in the Deep Learning Era: Successes and Limitations [J].
Borji, Ali .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (02) :679-700
[2]  
Chao FY, 2018, IEEE INT CONF MULTI
[3]  
Chao FY, 2020, IEEE INT CONF MULTI, P1, DOI [DOI 10.1109/icmew46912.2020.9105956, 10.1109/ICMEW46912.2020.9105956]
[4]   Cube Padding for Weakly-Supervised Saliency Prediction in 360° Videos [J].
Cheng, Hsien-Tzu ;
Chao, Chun-Hung ;
Dong, Jin-Dong ;
Wen, Hao-Kai ;
Liu, Tyng-Luh ;
Sun, Min .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1420-1429
[5]   A Dataset of Head and Eye Movements for 360° Videos [J].
David, Erwan J. ;
Gutierrez, Jesus ;
Coutrot, Antoine ;
Da Silva, Matthieu Perreira ;
Le Callet, Patrick .
PROCEEDINGS OF THE 9TH ACM MULTIMEDIA SYSTEMS CONFERENCE (MMSYS'18), 2018, :432-437
[6]   A Multimodal Saliency Model for Videos With High Audio-Visual Correspondence [J].
Min, Xiongkuo ;
Zhai, Guangtao ;
Zhou, Jiantao ;
Zhang, Xiao-Ping ;
Yang, Xiaokang ;
Guan, Xinping .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 :3805-3819
[7]   SalNet360: Saliency maps for omni-directional images with CNN [J].
Monroy, Rafael ;
Lutz, Sebastian ;
Chalasani, Tejo ;
Smolic, Aljosa .
SIGNAL PROCESSING-IMAGE COMMUNICATION, 2018, 69 :26-34
[8]  
Morgado P., 2018, Advances in Neural Information Processing Systems (NeurIPS)
[9]  
Ozcinar C, 2018, INT WORK QUAL MULTIM, P1
[10]   Visual Attention-Aware Omnidirectional Video Streaming Using Optimal Tiles for Virtual Reality [J].
Ozcinar, Cagri ;
Cabrera, Julian ;
Smolic, Aljosa .
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2019, 9 (01) :217-230