The multimodal process of machine translation is studied by focusing on the development of artificial intelligence in language processing and the transformation of vocabulary into numerical representations through vectorization. The neural networks such as CBOW and Skip-gram models are applied to analyze the word vectorization. It also explores the Transformer model with self-attention mechanism, emphasizing the importance of Layer Normalization for training stability and convergence speed. The emergence of ChatGPT as a state-of-the-art conversational AI model, highlights its role in assisting translators with language understanding and generation tasks. The application of generative artificial intelligence is discussed in translation practice, where human-machine interaction maximizes human intelligence while utilizing AI capabilities. DALL.E2 is capable of generating images from text, and the integration of image with translated text plays an important role in constructing the being of the intersemiotic translated work as they maintain the existential emotions effectively through the text-image multimodal interaction.