Learning Spatiotemporal Interactions for User-Generated Video Quality Assessment

被引:10
作者
Zhu, Hanwei [1 ]
Chen, Baoliang [1 ]
Zhu, Lingyu [1 ]
Wang, Shiqi [1 ,2 ]
机构
[1] City Univ Hong Kong, Dept Comp Sci, Hong Kong, Peoples R China
[2] City Univ Hong Kong, Shenzhen Res Inst, Shenzhen 518057, Peoples R China
基金
中国国家自然科学基金;
关键词
Distortion; Transformers; Feature extraction; Spatiotemporal phenomena; Video recording; Three-dimensional displays; Quality assessment; No-reference video quality assessment; user-generated content; vision transformer; SERIAL DEPENDENCE; FRAMEWORK;
D O I
10.1109/TCSVT.2022.3207148
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Distortions from spatial and temporal domains have been identified as the dominant factors that govern the visual quality. Though both have been studied independently in deep learning-based user-generated content (UGC) video quality assessment (VQA) by frame-wise distortion estimation and temporal quality aggregation, much less work has been dedicated to the integration of them with deep representations. In this paper, we propose a SpatioTemporal Interactive VQA (STI-VQA) model based upon the philosophy that video distortion can be inferred from the integration of both spatial characteristics and temporal motion, along with the flow of time. In particular, for each timestamp, both the spatial distortion explored by the feature statistics and local motion captured by feature difference are extracted and fed to a transformer network for the motion aware interaction learning. Meanwhile, the information flow of spatial distortion from the shallow layer to the deep layer is constructed adaptively during the temporal aggregation. The transformer network enjoys an advanced advantage for long-range dependencies modeling, leading to superior performance on UGC videos. Experimental results on five UGC video benchmarks demonstrate the effectiveness and efficiency of our STI-VQA model, and the source code will be available online at https://github.com/h4nwei/STI-VQA.
引用
收藏
页码:1031 / 1042
页数:12
相关论文
共 72 条
[1]   A framework for computationally efficient video quality assessment [J].
Akamine, Welington Y. L. ;
Freitas, Pedro Garcia ;
Farias, Mylene C. Q. .
SIGNAL PROCESSING-IMAGE COMMUNICATION, 2019, 70 :57-67
[2]   CNN-based no-reference video quality assessment method using a spatiotemporal saliency patch selection procedure [J].
Alamgeer, Sana ;
Irshad, Muhammad ;
Farias, Mylene C. Q. .
JOURNAL OF ELECTRONIC IMAGING, 2021, 30 (06)
[3]   Fast and reliable structure-oriented video noise estimation [J].
Amer, A ;
Dubois, E .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2005, 15 (01) :113-118
[4]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[5]  
Brown T., 2020, NeurIPS, P1877
[6]   Learning Generalized Spatial-Temporal Deep Feature Representation for No-Reference Video Quality Assessment [J].
Chen, Baoliang ;
Zhu, Lingyu ;
Li, Guo ;
Lu, Fangbo ;
Fan, Hongfei ;
Wang, Shiqi .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (04) :1903-1916
[7]   RIRNet: Recurrent-In-Recurrent Network for Video Quality Assessment [J].
Chen, Pengfei ;
Li, Leida ;
Ma, Lei ;
Wu, Jinjian ;
Shi, Guangming .
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :834-842
[8]  
Cisco Mobile VNI, 2020, CISCO VISUAL NETWORK
[9]   No-Reference Video Quality Assessment Using Natural Spatiotemporal Scene Statistics [J].
Dendi, Sathya Veera Reddy ;
Channappayya, Sumohana S. .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 :5612-5624
[10]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848