Cross-lingual spoken language understanding (cross-lingual SLU), as a key component of task-oriented dialogue systems, is widely used in various industrial and real-world scenarios, such as multilingual customer support systems, cross-border communication platforms, and international language learning tools. However, obtaining large-scale and high-quality datasets for SLU is challenging due to the high cost of dialogue collection and manual annotation, particularly for minority languages. As a result, there is increasing interest in leveraging high-resource language data for cross-lingual transfer learning. Existing approaches for zero-shot cross-lingual SLU primarily focus on the relationship between the source language sentence and the single generated cross-lingual sentence, disregarding the shared information among multiple languages. This limitation weakens the robustness of multilingual word embedding representations and hampers the scalability of the model. In this paper, we propose the multilingual mixture attention interaction framework with adversarial training to alleviate the above problems. Specifically, we leverage the source language sentence to generate multiple multilingual hybrid sentences, in which words can adaptively capture unambiguous representations from the aligned multilingual words during the encoding phase, and adversarial training is introduced to enhance the scalability of the model. Then, we incorporate the symmetric kernel self-attention module with positional embedding to learn contextual information within a sentence, and employ the multi-relation graph convolutional networks to learn different granularity information between two highly correlated intent detection and slot filling tasks. Experimental results on the public dataset MultiATIS++ demonstrate that our proposed model achieves state-of-the-art performance, and comprehensive analysis validates the effectiveness of each component.