Study of Spatio-Temporal Modeling in Video Quality Assessment

被引:8
作者
Fang, Yuming [1 ]
Li, Zhaoqian [1 ]
Yan, Jiebin [1 ]
Sui, Xiangjie [1 ]
Liu, Hantao [2 ]
机构
[1] Jiangxi Univ Finance & Econ, Sch Informat Technol, Nanchang 330032, Jiangxi, Peoples R China
[2] Cardiff Univ, Sch Comp Sci & Informat, Cardiff CF24 3AA, Wales
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Video quality assessment; spatio-temporal modeling; recurrent neural network; PREDICTION; DATABASE; FLOW;
D O I
10.1109/TIP.2023.3272480
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video quality assessment (VQA) has received remarkable attention recently. Most of the popular VQA models employ recurrent neural networks (RNNs) to capture the temporal quality variation of videos. However, each long-term video sequence is commonly labeled with a single quality score, with which RNNs might not be able to learn long-term quality variation well: What's the real role of RNNs in learning the visual quality of videos? Does it learn spatio-temporal representation as expected or just aggregating spatial features redundantly? In this study, we conduct a comprehensive study by training a family of VQA models with carefully designed frame sampling strategies and spatio-temporal fusion methods. Our extensive experiments on four publicly available in- the-wild video quality datasets lead to two main findings. First, the plausible spatio-temporal modeling module (i. e., RNNs) does not facilitate quality-aware spatio-temporal feature learning. Second, sparsely sampled video frames are capable of obtaining the competitive performance against using all video frames as the input. In other words, spatial features play a vital role in capturing video quality variation for VQA. To our best knowledge, this is the first work to explore the issue of spatio-temporal modeling in VQA.
引用
收藏
页码:2693 / 2702
页数:10
相关论文
共 78 条
[1]  
Abu-El-Haija S., 2016, YouTube-8m: A Large-Scale Video Classification Benchmark
[2]  
[Anonymous], 2000, Final Report From the Video Quality Experts Group on the Validation of Objective Models of Video Quality Assessment
[3]   Spatiotemporal Feature Integration and Model Fusion for Full Reference Video Quality Assessment [J].
Bampis, Christos G. ;
Li, Zhi ;
Bovik, Alan C. .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (08) :2256-2270
[4]   Efficient Video Classification Using Fewer Frames [J].
Bhardwaj, Shweta ;
Srinivasan, Mukundhan ;
Khapra, Mitesh M. .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :354-363
[5]   No-Reference Quality Assessment of H.264/AVC Encoded Video [J].
Brandao, Tomas ;
Queluz, Maria Paula .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2010, 20 (11) :1437-1447
[6]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[7]   RIRNet: Recurrent-In-Recurrent Network for Video Quality Assessment [J].
Chen, Pengfei ;
Li, Leida ;
Ma, Lei ;
Wu, Jinjian ;
Shi, Guangming .
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :834-842
[8]  
Cho KYHY, 2014, Arxiv, DOI arXiv:1406.1078
[9]   What is a good evaluation measure for semantic segmentation? [J].
Csurka, Gabriela ;
Larlus, Diane ;
Perronnin, Florent .
PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE 2013, 2013,
[10]   Histograms of oriented gradients for human detection [J].
Dalal, N ;
Triggs, B .
2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :886-893