The human brain combines multisensory information, such as visual and tactile information, in a statistically optimal manner to achieve perceptual interaction with the outside world. This article proposes a siamese-based visual-tactile fusion model for human subjective perception clustering tasks of multidimensional tactile attribute objects. Specifically, it introduces a similar comparison structure and uses distance measurement to determine category labels to simulate the comparative decision-making mechanism in human perception processes. In the feature extraction and fusion stage, referring to the multilevel processing characteristics of human signals, statistical texture features, and empirical features are extracted from the original interactive information of the visual and tactile channels. These features are combined with the corresponding deep features encoded by the neural network to achieve the information fusion of vision and touch in an adaptive dynamic weighting way. The experimental evaluation results indicate that the proposed model performs well in predicting human subjective perception results, and visual-tactile fusion exhibits a significant perceptual enhancement effect compared to a single visual or tactile channel.