Compared to 2-D electromagnetic tomography (2-D EMT), in 3-D electromagnetic tomography (3-D EMT) image reconstruction, the spatial distribution of electrical conductivity to be solved significantly increases, and the illconditioned nature of the equations becomes more severe, resulting in inaccuracies in the reconstruction results for traditional imaging algorithms. To solve this problem, a deep learning image reconstruction algorithm Transformer-V-Net (TV-Net) is proposed to deal with 3-D EMT in this article. The structure of TV-Net consists of four sequentially connected modules: data preprocessing, initial imaging, feature extraction, and image reconstruction. In the initial imaging phase, the Transformer encoder is introduced to capture the relationships between input sequences, effectively extracting rich feature information from the input sequences. In the feature extraction and image reconstruction phase, a 3-D fully convolutional network is employed to better understand and learn the 3-D spatial structure in the input data, thereby achieving more accurate feature extraction and representation. In this study, 13 851 image samples are designed for training and testing, with each sample consisting of a voltage vector and its corresponding medium distribution vector. In addition, test data with noise and random samples not in the dataset are used to test the noise resistance and generalization ability of the network, respectively. The results show that compared with traditional algorithms, the TV-Net network performs better in both visual effects and quantitative evaluation criteria, verifying its feasibility and effectiveness.