One-dimensional (1D) airborne transient electromagnetic (ATEM) inversion is still the most popular method applied in field data because the conventional three-dimensional (3D) method requires forward calculations during the inversion process, which is time-consuming, and the inversion results are highly influenced by the initial model. Moreover, the conventional 3D inversion process is unstable and susceptible to converging to local optima. However, the true underground structure is 3D, we need a 3D inversion to study the structure details. We present a novel deep-learning framework, ATEM3D-Net, designed for the 3D inversion of ATEM data. ATEM3DNet leverages an encoder-decoder architecture that integrates 3D U-Net with ConvLSTM to perform an end-toend mapping from electromagnetic response data to subsurface resistivity models, where the ConvLSTM can learn the spatiotemporal dependencies of ATEM data to obtain better inversion results. Furthermore, we optimize the network training strategy to make the network converge to the global optimal. We evaluate the performance of ATEM3D-Net using both forward modeling data and field model synthetic data, demonstrating its superior ability to handle noise and its generalization across different geological settings.