In scenarios with multiple speakers, humans can selectively attend to a specific speaker through auditory attention to obtain desired information. Similarly, auditory assistance devices rely on auditory attention detection (AAD) to accomplish this task. Current AAD algorithms face challenges such as low signal-to-noise ratio in EEG signals, susceptibility to noise interference from eye and muscle signals, complex associations between signals of different frequencies and auditory attention, and potential impacts of differences in brain health on frequency domain images. In this paper, we propose a novel multi-scale, multi-plane 3D convolutional neural network. Firstly, under the guidance of spatial attention, features are adequately extracted from EEG frequency domain data from multiple plane directions and scales to mitigate noise interference. Secondly, by utilizing multi-channel grouped convolution to decouple features of each channel while capturing potential associations between different frequency features and auditory attention. Finally, a clustering loss function is employed to make classification scores closer to the clustering centers of the categories, enhancing generalization while avoiding overfitting. Experimental results on two datasets demonstrate that our network outperforms competing networks under different window times, which is beneficial for the development of practical neural-guided hearing devices.