Exploiting Spatial and Angular Correlations With Deep Efficient Transformers for Light Field Image Super-Resolution

被引:75
作者
Cong, Ruixuan [1 ,2 ,3 ]
Sheng, Hao [1 ,2 ,3 ]
Yang, Da [1 ,2 ,3 ]
Cui, Zhenglong [1 ,2 ,3 ]
Chen, Rongshan [1 ,2 ,3 ]
机构
[1] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Virtual Real Technol & Syst, Beijing 100191, Peoples R China
[2] Beihang Hangzhou Innovat Inst Yuhang, Hangzhou 310023, Peoples R China
[3] Macao Polytech Univ, Fac Appl Sci, Macau 999078, Peoples R China
关键词
Transformers; Computational modeling; Superresolution; Spatial resolution; Feature extraction; Light fields; Convolution; Light field; transformer; super-resolution; sub-sampling spatial modeling; multi-scale angular modeling; NETWORK;
D O I
10.1109/TMM.2023.3282465
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Global context information is particularly important for comprehensive scene understanding. It helps clarify local confusions and smooth predictions to achieve fine-grained and coherent results. However, most existing light field processing methods leverage convolution layers to model spatial and angular information. The limited receptive field restricts them to learn long-range dependency in LF structure. In this article, we propose a novel network based on deep efficient transformers (i.e., LF-DET) for LF spatial super-resolution. It develops a spatial-angular separable transformer encoder with two modeling strategies termed as sub-sampling spatial modeling and multi-scale angular modeling for global context interaction. Specifically, the former utilizes a sub-sampling convolution layer to alleviate the problem of huge computational cost when capturing spatial information within each sub-aperture image. In this way, our model can cascade more transformers to continuously enhance feature representation with limited resources. The latter processes multi-scale macro-pixel regions to extract and aggregate angular features focusing on different disparity ranges to well adapt to disparity variations. Besides, we capture strong similarities among surrounding pixels by dynamic positional encodings to fill the gap of transformers that lack of local information interaction. The experimental results on both real-world and synthetic LF datasets confirm our LF-DET achieves a significant performance improvement compared with state-of-the-art methods. Furthermore, our LF-DET shows high robustness to disparity variations through the proposed multi-scale angular modeling.
引用
收藏
页码:1421 / 1435
页数:15
相关论文
共 59 条
[1]  
Alain M, 2018, IEEE IMAGE PROC, P2501, DOI 10.1109/ICIP.2018.8451162
[2]  
Alexey D., 2021, P 9 INT C LEARN REPR
[3]  
[Anonymous], 2013, P VISION MODELING VI
[4]   The Light Field Camera: Extended Depth of Field, Aliasing, and Superresolution [J].
Bishop, Tom E. ;
Favaro, Paolo .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2012, 34 (05) :972-986
[5]  
Brown TB, 2020, ADV NEUR IN, V33
[6]   CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].
Chen, Chun-Fu ;
Fan, Quanfu ;
Panda, Rameswar .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356
[7]   Pre-Trained Image Processing Transformer [J].
Chen, Hanting ;
Wang, Yunhe ;
Guo, Tianyu ;
Xu, Chang ;
Deng, Yiping ;
Liu, Zhenhua ;
Ma, Siwei ;
Xu, Chunjing ;
Xu, Chao ;
Gao, Wen .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :12294-12305
[8]   Deep Light Field Spatial Super-Resolution Using Heterogeneous Imaging [J].
Chen, Yeyao ;
Jiang, Gangyi ;
Yu, Mei ;
Xu, Haiyong ;
Ho, Yo-Sung .
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2023, 29 (10) :4183-4197
[9]  
Chu X., 2023, P 11 INT C LEARN REP
[10]  
Chu XX, 2021, ADV NEUR IN