In recent years, quantum computing has undergone significant developments and has established its supremacy in many application domains. While quantum hardware is accessible to the public through the cloud environment, a robust and efficient quantum circuit simulator is necessary to investigate the constraints and foster quantum computing development, such as quantum algorithm development and quantum device architecture exploration. In this paper, we observe that most of the publicly available quantum circuit simulators (e.g., QISKit from IBM, QDK from Microsoft, and Qsim-Cirq from Google) suffer from slow simulation and poor scalabfiity when the number of qubits increases. To this end, we systematically investigate the deficiencies in quantum circuit simulation (QCS) and propose Q-GPU, a framework that leverages GPUs with comprehensive optimizations to allow efficient and scalable QCS. Specifically, Q-GPU features i) proactive state amplitude transfer, ii) zero state amplitude pruning, iii) delayed qubit involvement, and iv) lossless non-zero state amplitude compression. Experimental results across nine representative quantum circuits indicate that Q-GPU significantly reduces the execution time of the state-of-the-art GPU-based QCS by 71.89% (3.55x speedup). Q-GPU also outperforms the state-of-the-art OpenMP CPU implementation, the Google Qsim-Cirq simulator, and the Microsoft QDK simulator by 1.49 x, 2.02 x, and 10.82 x, respectively.