DEEP VIDEO INPAINTING GUIDED BY AUDIO-VISUAL SELF-SUPERVISION

被引:0
作者
Kim, Kyuyeon [1 ]
Jung, Junsik [1 ]
Kim, Woo Jae [1 ]
Yoon, Sung-Eui [1 ]
机构
[1] Korea Adv Inst Sci & Technol KAIST, Sch Comp, Daejeon, South Korea
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
基金
新加坡国家研究基金会;
关键词
audio-visual learning; audio-visual correspondence; audio-visual network; deep video inpainting;
D O I
10.1109/ICASSP43922.2022.9747073
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Humans can easily imagine a scene from auditory information based on their prior knowledge of audio-visual events. In this paper, we mimic this innate human ability in deep learning models to improve the quality of video inpainting. To implement the prior knowledge, we first train the audio-visual network, which learns the correspondence between auditory and visual information. Then, the audiovisual network is employed as a guider that conveys the prior knowledge of audio-visual correspondence to the video inpainting network. This prior knowledge is transferred through our proposed two novel losses: audio-visual attention loss and audio-visual pseudoclass consistency loss. These two losses further improve the performance of the video inpainting by encouraging the inpainting result to have a high correspondence to its synchronized audio. Experimental results demonstrate that our proposed method can restore a wider domain of video scenes and is particularly effective when the sounding object in the scene is partially blinded.
引用
收藏
页码:1970 / 1974
页数:5
相关论文
共 24 条
[1]   Look, Listen and Learn [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617
[2]  
Arandjelovic Relja, 2018, EUROPEAN C COMPUTER, P435
[3]   Free-form Video Inpainting with 3D Gated Convolution and Temporal PatchGAN [J].
Chang, Ya-Liang ;
Liu, Zhe Yu ;
Lee, Kuan-Ying ;
Hsu, Winston .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9065-9074
[4]  
Chang Ya-Liang, 2019, BRIT MACH VIS C BMVC
[5]   Localizing Visual Sounds the Hard Way [J].
Chen, Honglie ;
Xie, Weidi ;
Afouras, Triantafyllos ;
Nagrani, Arsha ;
Vedaldi, Andrea ;
Zisserman, Andrew .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :16862-16871
[6]   Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation [J].
Ephrat, Ariel ;
Mosseri, Inbar ;
Lang, Oran ;
Dekel, Tali ;
Wilson, Kevin ;
Hassidim, Avinatan ;
Freeman, William T. ;
Rubinstein, Michael .
ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04)
[7]  
Hu Di, 2020, Adv. Neural Inf. Process. Syst., V33
[8]   Temporally Coherent Completion of Dynamic Video [J].
Huang, Jia-Bin ;
Kang, Sing Bing ;
Ahuja, Narendra ;
Kopf, Johannes .
ACM TRANSACTIONS ON GRAPHICS, 2016, 35 (06)
[9]   Universal Physical Camouflage Attacks on Object Detectors [J].
Huang, Lifeng ;
Gao, Chengying ;
Zhou, Yuyin ;
Xie, Cihang ;
Yuille, Alan L. ;
Zou, Changqing ;
Liu, Ning .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :717-726
[10]   You Said That?: Synthesising Talking Faces from Audio [J].
Jamaludin, Amir ;
Chung, Joon Son ;
Zisserman, Andrew .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (11-12) :1767-1779