Emotion recognition in conversations (ERC) plays a crucial role in human–computer interaction and affective computing. However, existing ERC methods face several challenges, including the lack of sufficient data annotations and the difficulties in integrating multimodal information effectively. To overcome these challenges, we propose SSMIM, a novel semi-supervised multimodal emotion recognition framework. This framework enhances emotional feature representation through a primary-modality-guided strategy that combines both intra-modality representations and cross-modality interactions. Additionally, SSMIM employs a context modeling approach that utilizes directed acyclic graph and bidirectional gated recurrent unit to capture contextual dependencies in dialogues from both multimodal and primary-modality perspectives, thereby improving emotion classification accuracy. Moreover, to address the challenges of dynamic data and limited annotations in real-time scenarios, SSMIM integrates an online learning mechanism. This mechanism leverages pseudo-label generation and self-training to tackle the issue of insufficient labeled data and allows the model to adapt to real-time changes in dialogue contexts. Experimental results on IEMOCAP, MELD, and CMU-MOSEI show that SSMIM outperforms existing methods, achieving state-of-the-art performance.