We present an alternative method for solving the motion stereo problem for two views in a variational framework. Instead of directly solving for the depth, we simultaneously estimate the optical flow and the 3D structure by minimizing a joint energy function consisting of an optical flow constraint and a 3D constraint. Compared to stereo methods, we impose the epipolar geometry as a soft constraint which gives the search space more flexibility instead of naively following the epipolar lines, resulting in a correspondence that is more robust to small errors in pose estimation. This approach also allows us to use fast dense matching methods for handling large displacement as well as shape-based smoothness constraint on the 3D surface. We show in the results that, in terms of accuracy, our method outperforms the state-of-the-art method in two-frame variational depth estimation and comparable results to existing optical flow estimation methods. With our implementation, we are able to achieve real-time performance using modern GPUs.