TVT-Transformer: A Tactile-visual-textual fusion network for object recognition

被引:6
作者
Li, Baojiang [1 ,2 ]
Li, Liang [1 ,2 ]
Wang, Haiyan [1 ,2 ]
Chen, Guochu [1 ,2 ]
Wang, Bin [1 ,2 ]
Qiu, Shengjie [1 ,2 ]
机构
[1] Shanghai Dianji Univ, Sch Elect Engn, Shanghai, Peoples R China
[2] Shanghai Dianji Univ, Intelligent Decis & Control Technol Inst, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
Embodied intelligence; Multi-modal fusion; Object recognition; TVT-Transformer; Attention mechanism; DEEP; SENSOR; SKIN;
D O I
10.1016/j.inffus.2025.102943
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the pursuit of higher levels of intelligence, embodied intelligences need to integrate information from multiple perceptual channels through a multimodal information fusion mechanism to comprehensively understand the surrounding scene and the manipulated objects. Most of the current multimodal perception research focuses on the fusion of vision-based, and most of them involve only the fusion of two modalities, while the fusion of three or more modalities is seldom explored. Therefore, we propose TVT-Transformer (TVT: Tactile-Visual-Textual): a new framework for joint learning from three modal data: tactile, visual and semantic text. The approach utilizes Attention mechanism to deeply mine and align information from different perceptual modalities, allowing the model to efficiently integrate features from tactile, visual, and semantic text data, enabling a deeper level of information fusion through cross-modal interaction. The framework introduces a novel semantic representation approach to generate standardized semantic descriptions by combining human observation and touch interactions with objects. Next, the semantic descriptions are encoded using a pre-trained Bert model, aligning them with visual and tactile information. Following this, the Query (Q), Key (K), and Value (V) components from tactile, visual, and textual modalities are integrated into a unified Q, K, and V. Subsequently, the Attention mechanism within the Transformer architecture is employed for cross-attention computation, enabling more accurate and efficient cross-modal feature integration and understanding. The textual modality provides semantic support that enhances the effectiveness of information integration and improves the accuracy of object recognition. With tactile sensor data, visual image data, and corresponding semantic text descriptions as input data, the method has been validated for its effectiveness and superiority on both publicly available and self-made datasets, enhances the feature expression capability, achieves significant performance enhancement in multimodal data integration, and improves the accuracy of object recognition. Compared with the classical Transformer, TVT-Transformer is able to effectively fuse visual, tactile, and semantic textual information through a cross-modal Self-Attention mechanism, and exhibits greater adaptability and robustness in processing multimodal information. This study not only provides a new perspective in the field of multimodal information fusion, but also provides a strong technical support for the development of embodied intelligence. The TVT-Transformer framework has a wide range of potential applications, and it is expected to play an important role in the future in the fields of intelligent robotics, human-robot interaction, and assisted decision-making. The resource has been released at https://github.com/huakaichengbei/MSDO/tree/master.
引用
收藏
页数:20
相关论文
共 57 条
[1]  
Albini A, 2017, IEEE INT C INT ROBOT, P4348, DOI 10.1109/IROS.2017.8206300
[2]   LASER: Learning a Latent Action Space for Efficient Reinforcement Learning [J].
Allshire, Arthur ;
Martin-Martin, Roberto ;
Lin, Charles ;
Manuel, Shawn ;
Savarese, Silvio ;
Garg, Animesh .
2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, :6650-6656
[3]   Deep Gated Multi-modal Learning: In-hand Object Pose Changes Estimation using Tactile and Image Data [J].
Anzai, Tomoki ;
Takahashi, Iyuki .
2020 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2020, :9361-9368
[4]   Fusion of tactile and visual information in deep learning models for object recognition [J].
Babadian, Reza Pebdani ;
Faez, Karim ;
Amiri, Mahmood ;
Falotico, Egidio .
INFORMATION FUSION, 2023, 92 :313-325
[5]  
Bechtle S, 2021, Arxiv, DOI arXiv:2011.03882
[6]  
Belkhale S., 2024, arXiv
[7]  
Brown TB, 2020, ADV NEUR IN, V33
[8]  
Chen YZ, 2022, PR MACH LEARN RES, V205, P2026
[9]  
Cheng CY, 2024, Arxiv, DOI arXiv:2312.14209
[10]   Multisensory Integration as per Technological Advances: A Review [J].
Cornelio, Patricia ;
Velasco, Carlos ;
Obrist, Marianna .
FRONTIERS IN NEUROSCIENCE, 2021, 15