Survey of Research on Deep Multimodal Representation Learning

被引:1
作者
Pan, Mengzhu [1 ]
Li, Qianmu [1 ]
Qiu, Tian [1 ]
机构
[1] School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing
关键词
deep learning; multimodal alignment; multimodal fusion; multimodal representation;
D O I
10.3778/j.issn.1002-8331.2206-0145
中图分类号
学科分类号
摘要
Although deep learning has been widely used in many fields because of its powerful nonlinear representation capabilities, the structural and semantic gap between multi-source heterogeneous modal data seriously hinders the application of subsequent deep learning models. Many scholars have proposed a large number of representation learning methods to explore the correlation and complementarity between different modalities, and improve the performance of deep learning prediction and generalization. However, the research on multimodal representation learning is still in its infancy, and there are still many scientific problems to be solved. So far, multimodal representation learning still lacks a unified cognition, and the architecture and evaluation metrics of multimodal representation learning research are not fully clear. According to the feature structure, semantic information and representation ability of different modalities, this paper studies and analyzes the progress of deep multimodal representation learning from the perspectives of representation fusion and representation alignment. And the existing research work is systematically summarized and scientifically classified. At the same time, this paper analyzes the basic structure, application scenarios and key issues of representative frameworks and models, analyzes the theoretical basis and latest development of deep multimodal representation learning, and points out the current challenges and future development of multimodal representation learning research, to further promote the development and application of deep multimodal representation learning. © 2024 Journal of Computer Engineering and Applications Beijing Co., Ltd.; Science Press. All rights reserved.
引用
收藏
页码:48 / 64
页数:16
相关论文
共 89 条
[21]  
GOODFELLOW I, POUGET-ABADIE J, MIRZA M, Et al., Generative adversarial nets[C], Advances in Neural Information Processing Systems, (2014)
[22]  
XU X, LIN K,, YANG Y, Et al., Joint feature synthesis and embedding:adversarial cross-modal retrieval revisited[J], IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 6, pp. 3030-3047, (2022)
[23]  
QI J, PENG Y., Cross-modal bidirectional translation via reinforcement learning[C], Twenty-Seventh International Joint Conference on Artificial Intelligence, pp. 2630-2636, (2018)
[24]  
ZHU H, WEIBEL J B, LU S., Discriminative multi-modal feature fusion for rgbd indoor scene recognition[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2969-2976, (2016)
[25]  
SAHU G,, VECHTOMOVA O., Adaptive fusion techniques for multimodal data[J], (2019)
[26]  
HONG D, YAO J, MENG D, Et al., Multimodal GANs:toward crossmodal hyperspectral-multispectral image segmentation[J], IEEE Transactions on Geoscience and Remote Sensing, 59, 6, pp. 5103-5113, (2020)
[27]  
YU N, DAVIS L S, FRITZ M., Attributing fake images to GANs:learning and analyzing gan fingerprints[C], Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7556-7566, (2019)
[28]  
REED S, AKATA Z, YAN X, Et al., Generative adversarial text to image synthesis[C], International Conference on Machine Learning, pp. 1060-1069, (2016)
[29]  
REED S, AKATA Z, LEE H, Et al., Learning deep representations of fine- grained visual descriptions[C], Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 49-58, (2016)
[30]  
HINZ T, HEINRICH S, WERMTER S., Semantic object accuracy for generative text-to-image synthesis[J], IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 3, pp. 1552-1565, (2020)