Indoor multi-person tracking is a widely explored area of research. However, publicly available datasets are either oversimplified or provide only visual data. To fill this gap, our paper presents the RAV4D dataset, a novel multimodal dataset that includes data from radar, microphone arrays, and stereo cameras. This dataset is characterised by the provision of 3D positions, Euler angles and Doppler velocities. By integrating these different data types, RAV4D aims to exploit the synergistic and complementary capabilities of these modalities to improve tracking performance. The development of RAV4D addresses two main challenges: sensor calibration and 3D annotation. A novel calibration target is designed to effectively calibrate the radar, stereo camera and microphone array. In addition, a visually guided annotation framework is proposed to address the challenge of annotating radar data. This framework uses head positions, heading orientation and depth information from stereo cameras and radar to establish accurate ground truth for multimodal tracking trajectories. The dataset is publicly available at https://zenodo.org/records/10208199.