共 50 条
ETFormer: An Efficient Transformer Based on Multimodal Hybrid Fusion and Representation Learning for RGB-D-T Salient Object Detection
被引:1
作者:
Qiu, Jiyuan
[1
]
Jiang, Chen
[1
]
Wang, Haowen
[1
]
机构:
[1] Tsinghua Univ, Sch Aerosp Engn, Beijing 100084, Peoples R China
关键词:
Feature extraction;
Training;
Decoding;
Computer architecture;
Transformers;
Representation learning;
Object detection;
Multimodal hybrid fusion;
representation learning;
RGB-D-T salient object detection;
transformer;
NETWORK;
D O I:
10.1109/LSP.2024.3465351
中图分类号:
TM [电工技术];
TN [电子技术、通信技术];
学科分类号:
0808 ;
0809 ;
摘要:
Due to the susceptibility of depth and thermal images to environmental interferences, researchers began to combine three modalities for salient object detection (SOD). In this letter, we propose an efficient transformer network (ETFormer) based on multimodal hybrid fusion and representation learning for RGB-D-T SOD. First, unlike most works, we design a backbone to extract three modal information, and propose a multi-modal multi-head attention module (MMAM) for feature fusion, which improves network performance while reducing compute redundancy. Secondly, we reassembled a three-modal dataset called R-D-T ImageNet-1K to pretrain the network to solve the problem that other modalities are still using RGB modality during pretraining. Finally, through extensive experiments, our proposed method can combine the advantages of different modalities and achieve better performance compared to other existing methods.
引用
收藏
页码:2930 / 2934
页数:5
相关论文