Spatiotemporal Representation Learning for Blind Video Quality Assessment

被引:31
作者
Liu, Yongxu [1 ]
Wu, Jinjian [1 ]
Li, Leida [1 ]
Dong, Weisheng [1 ]
Zhang, Jinpeng [2 ]
Shi, Guangming [1 ]
机构
[1] Xidian Univ, Sch Artificial Intelligence, Xian 710071, Peoples R China
[2] Second Acad CASIC, X LAB, Beijing 100854, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Spatiotemporal phenomena; Databases; Quality assessment; Video recording; Data models; Task analysis; Blind video quality assessment; spatiotemporal representation; weakly supervised learning; SIMILARITY;
D O I
10.1109/TCSVT.2021.3114509
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Blind video quality assessment (BVQA) is of great importance for video-related applications, yet still challenging even in this deep learning era. The difficulty lies in the shortage of large-scale labeled data, thus making it hard to train a robust spatiotemporal encoder for BVQA. To relieve such difficulty, we first build a video dataset, which contains over 320K samples suffering from various compression and transmission artifacts. While manually annotating the dataset with subjective perception is much labor-intensive and time-consuming, we adopt reference-based VQA algorithms to weakly label the data automatically. We consider that single weak label is derived from single knowledge, which is deficient and incomplete for VQA. To alleviate the bias from single weak label (i.e., single knowledge) in the weakly labeled dataset, we propose HEterogeneous Knowledge Ensemble (HEKE) for spatiotemporal representation learning. Compared to learning from single knowledge, learning with HEKE is thought to achieve a lower infimum theoretically, and obtain richer representation. On the basis of the built dataset and the HEKE methodology, a feature encoder specific to BVQA is formed, and directly extract spatiotemporal representation from videos. Then, the video quality can be either acquired in a completely BVQA manner without ground truth, or via a finetuning-based regressor with labels. Extensive experiments on various VQA databases show that our BVQA model with the pretrained encoder achieves the state-of-the-art performance. More surprisingly, even trained on the synthetic data, our model still shows competitive performance on authentic databases. The data and source code will be available at https://github.com/Sissuire/BVQA-HEKE.
引用
收藏
页码:3500 / 3513
页数:14
相关论文
共 57 条
[1]  
Ahn S, 2018, IEEE IMAGE PROC, P619, DOI 10.1109/ICIP.2018.8451450
[2]  
[Anonymous], 1999, RECP910 ITUT
[3]  
[Anonymous], 2016, Toward a Practical Perceptual Video Quality Metric
[4]  
[Anonymous], 2011, Acm T. Intel. Syst. Tec., DOI DOI 10.1145/1961189.1961199
[5]  
Brown G, 2005, J MACH LEARN RES, V6, P1621
[6]   Learning Generalized Spatial-Temporal Deep Feature Representation for No-Reference Video Quality Assessment [J].
Chen, Baoliang ;
Zhu, Lingyu ;
Li, Guo ;
Lu, Fangbo ;
Fan, Hongfei ;
Wang, Shiqi .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (04) :1903-1916
[7]   RIRNet: Recurrent-In-Recurrent Network for Video Quality Assessment [J].
Chen, Pengfei ;
Li, Leida ;
Ma, Lei ;
Wu, Jinjian ;
Shi, Guangming .
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :834-842
[8]   Hybrid Distortion Ranking Tuned Bitstream-Layer Video Quality Assessment [J].
Chen, Zhibo ;
Liao, Ning ;
Gu, Xiaodong ;
Wu, Feng ;
Shi, Guangming .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2016, 26 (06) :1029-1043
[9]   SUBJECTIVE ASSESSMENT OF H.264/AVC VIDEO SEQUENCES TRANSMITTED OVER A NOISY CHANNEL [J].
De Simone, F. ;
Naccari, M. ;
Tagliasacchi, M. ;
Dufaux, F. ;
Tubaro, S. ;
Ebrahimi, T. .
QOMEX: 2009 INTERNATIONAL WORKSHOP ON QUALITY OF MULTIMEDIA EXPERIENCE, 2009, :204-+
[10]   No-Reference Video Quality Assessment Using Natural Spatiotemporal Scene Statistics [J].
Dendi, Sathya Veera Reddy ;
Channappayya, Sumohana S. .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 :5612-5624