An X-ray free electron laser (XFEL) facility can produce on the order of 1,000,000 extremely bright X-ray light pulses per second. Using an XFEL to image the atomic structure of a molecule requires fast analysis of an enormous amount of data, estimated to exceed one terabyte per second and requiring petabytes of storage. The SpiniFEL application provides such analysis by determining the 3D structure of proteins from single-particle imaging (SPI) experiments performed using XFELs, but it needs significantly better performance and efficiency to scale and keep up with the terabyte-per-second data production. Thus, this paper addresses the high-performance computing optimizations and scaling needed to improve this 3D reconstruction of SPI data. First, we optimize data movement, memory efficiency, and algorithms to improve the per-node computational efficiency and deliver a 5.28x speedup over the baseline GPU implementation. In addition, we achieved a 485x speedup for the post-analysis reconstruction resolution, which previously took as long as the 3D reconstruction of SPI data. Second, we present a novel distributed shared-memory computational algorithm to hide data latency and load-balance network traffic, thus enabling the processing of 128x more orientations than previously possible. Third, we conduct an exploratory study over the hyperparameter space for the SpiniFEL application to identify the optimal parameters for our underlying target hardware, which ultimately led to an up to 1.25x speedup for the number of streams. Overall, we achieve a 6.6x speedup (i.e., 5.28 x 1.25) over the previous fastest GPU-MPI-based SpiniFEL realization.