Molecular dynamics is an important method for protein structure research, which is called the holy grail of bioinformatics. It is also a popular application in large supercomputers. GROMACS is a widely used software package for molecular dynamics simulation. Almost all current attempts are only focus on the most costly part (consuming less than 70% of the time), while other parts are still running serially. In the official optimized version, a MIC even may not be as powerful as a highly optimized Xeon CPU. This paper introduces a deeply optimization on a single MIC card. We do not only optimize the hotspot in parallel, but also make the sequential part as efficiency as possible. In another word, we optimize the whole iteration loop. We try to explore potential performance as much as possible, combining a lot of optimization techniques, such as using SIMD instructions, rearranging data layout, redesigning data structure and adjusting work stream. We have achieved a speed of 500 steps/sec for a non-water protein molecule consisting of 3657 atoms. The speedup ratio is more than twice of what the official optimized version claims.