Survey of Research on Deep Multimodal Representation Learning

被引:1
作者
Pan, Mengzhu [1 ]
Li, Qianmu [1 ]
Qiu, Tian [1 ]
机构
[1] School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing
关键词
deep learning; multimodal alignment; multimodal fusion; multimodal representation;
D O I
10.3778/j.issn.1002-8331.2206-0145
中图分类号
学科分类号
摘要
Although deep learning has been widely used in many fields because of its powerful nonlinear representation capabilities, the structural and semantic gap between multi-source heterogeneous modal data seriously hinders the application of subsequent deep learning models. Many scholars have proposed a large number of representation learning methods to explore the correlation and complementarity between different modalities, and improve the performance of deep learning prediction and generalization. However, the research on multimodal representation learning is still in its infancy, and there are still many scientific problems to be solved. So far, multimodal representation learning still lacks a unified cognition, and the architecture and evaluation metrics of multimodal representation learning research are not fully clear. According to the feature structure, semantic information and representation ability of different modalities, this paper studies and analyzes the progress of deep multimodal representation learning from the perspectives of representation fusion and representation alignment. And the existing research work is systematically summarized and scientifically classified. At the same time, this paper analyzes the basic structure, application scenarios and key issues of representative frameworks and models, analyzes the theoretical basis and latest development of deep multimodal representation learning, and points out the current challenges and future development of multimodal representation learning research, to further promote the development and application of deep multimodal representation learning. © 2024 Journal of Computer Engineering and Applications Beijing Co., Ltd.; Science Press. All rights reserved.
引用
收藏
页码:48 / 64
页数:16
相关论文
共 89 条
[1]  
RASIWASIA N, COSTA PEREIRA J, COVIELLO E, Et al., A new approach to cross-modal multimedia retrieval[C], Proceedings of the 18th ACM International Conference on Multimedia, pp. 251-260, (2010)
[2]  
LECUN Y, HINTON G., Deep learning[J], Nature, 521, 7553, (2015)
[3]  
FROME A L, CORRADO G S, SHLENS J B, Et al., DeViSE:a deep visual-semantic embedding model, Proceedings of NIPS, (2013)
[4]  
ANDREW G, ARORA R, Et al., Deep canonical correlation analysis[C], International Conference on International Conference on Machine Learning, (2013)
[5]  
PENG Y, YUAN Y., Modality-specific cross-modal similarity measurement with recurrent attention network[J], IEEE Transactions on Image Processing, 27, 11, pp. 5585-5599, (2018)
[6]  
CORTES C, VAPNIK V., Support-vector networks[J], Machine Learning, 20, 3, pp. 273-297, (1995)
[7]  
MORADE S S, PATNAIK S., Comparison of classifiers for lip reading with CUAVE and TULIPS database[J], Optik, 126, 24, pp. 5753-5761, (2015)
[8]  
NGIAM J, KHOSLA A, KIM M, Et al., Multimodal deep learning, Proceedings of ICML, (2011)
[9]  
SRIVASTAVA N, SALAKHUTDINOV R., Multimodal learning with deep boltzmann machines[J], Journal of Machine Learning Research, 15, 1, pp. 2949-2980, (2012)
[10]  
VASWANI A, SHAZEER N, PARMAR N, Et al., Attention is all you need, Advances in Neural Information Processing Systems, (2017)