Progressive Spatial-temporal Collaborative Network for Video Frame Interpolation

被引:14
作者
Hu, Mengshun [1 ]
Jiang, Kui [1 ]
Liao, Liang [2 ]
Nie, Zhixiang [1 ]
Xiao, Jing [1 ]
Wang, Zheng [1 ]
机构
[1] Wuhan Univ, Hubei Key Lab Multimedia & Network Commun Engn, Ctr Multimedia Software, Natl Engn Res,Sch Comp Sci, Wuhan, Peoples R China
[2] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
基金
中国国家自然科学基金;
关键词
Video frame interpolation; Collaborative network; Content-guided motion; Motion-guided content; Multi-scale;
D O I
10.1145/3503161.3547875
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Most video frame interpolation (VFI) algorithms infer the intermediate frame with the help of adjacent frames through the cascaded motion estimation and content refinement.However, the intrinsic correlations between motion and content are barely investigated, commonly producing interpolated results with inconsistency and blurry contents.Specifically, we first discover a simple yet essential domain knowledge that contents and motions characteristics should be homogeneous to a certain degree from the same objects, and formulate the consistency into the loss function for model optimization. Based on this, we propose to learn the collaborative representation between motions and contents, and construct a novel progressive spatial-temporal Collaborative network (Prost-Net) for video frame interpolation. Specifically, we develop a content-guided motion module (CGMM) and a motion-guided content module (MGCM) for individual content and motion representation. In particular, the predicted motion in CGMM is used to guide the fusion and distillation of contents for intermediate frame interpolation, and vice versa. Furthermore, by considering collaborative strategy in a multi-scale framework, our Prost-Net progressively optimizes motions and contents in a coarse-to-fine manner, making it robust to various challenging scenarios (e.g., occlusion and large motions) in VFI. Extensive experiments on the benchmark datasets demonstrate that our method significantly outperforms state-of-the-art methods.
引用
收藏
页码:2145 / 2153
页数:9
相关论文
共 55 条
[1]   Depth-Aware Video Frame Interpolation [J].
Bao, Wenbo ;
Lai, Wei-Sheng ;
Ma, Chao ;
Zhang, Xiaoyun ;
Gao, Zhiyong ;
Yang, Ming-Hsuan .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :3698-3707
[2]  
Bao Wenbo, 2019, PAMI
[3]  
Bulat A, 2021, ADV NEUR IN
[4]  
Chandraker Manmohan, 2020, ARXIV201208512
[5]  
CHARBONNIER P, 1994, IEEE IMAGE PROC, P168
[6]   CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].
Chen, Chun-Fu ;
Fan, Quanfu ;
Panda, Rameswar .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356
[7]  
Cheng H, 2017, IEEE INT C INT ROBOT, P3446, DOI 10.1109/IROS.2017.8206184
[8]   Multiple Video Frame Interpolation via Enhanced Deformable Separable Convolution [J].
Cheng, Xianhang ;
Chen, Zhenzhong .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) :7029-7045
[9]  
Cheng XH, 2020, AAAI CONF ARTIF INTE, V34, P10607
[10]  
Cheng Xianhang, 2019, TCSVT