Virtual Reality (VR) is a technology that creates a simulated immersive environment, allowing users to be more engaged and interactive. The user can interact with a VR environment using head-mounted displays, hand controllers, and, in some cases, speech. VR has been widely used in various industries and areas, one of which is education, to build simulated and interactive experiences. However, little prior research has explored the integration of speech input in VR educational environments. Moreover, there is currently a lack of understanding of how speech and verbal/textual interaction can support personalisation in VR in general, and in the educational domain, in particular. Thus, this research targets to fill this gap. Our long-term goal is to incorporate speech and text-based interaction in the VR learning environment, to support smooth and natural personalisation of the learning interaction. Personalisation here is used in the classical AI in Education sense, of adapting the learning system to a learner, e.g. to their level or needs. As a first step, we have started exploring and comparing different speech recognition models that support VR applications. Further, we will personalise the user experience, by utilising the text generated from the speech input and applying NLP and adaptation techniques to it. Furthermore, we will investigate the impact of this kind of personalisation on learner engagement and outcomes.