In this paper, we propose a new framework for compressive video sensing (CVS) that exploits the inherent spatial and temporal redundancies of a video sequence, effectively. The proposed method splits the video sequence into the key and non-key frames followed by dividing each frame into the small non-overlapping blocks of equal sizes. At the decoder side, the key frames are reconstructed using adaptively learned sparsifying (ALS) basis via l(0) minimization, in order to exploit the spatial redundancy. Also, three well-known dictionary learning algorithms are investigated in our method. For recovery of the non-key frames, a prediction of the current frame is initialized, by using the previous reconstructed frame, in order to exploit the temporal redundancy. The prediction is employed in a proper optimization problem to recover the current non-key frame. To compare our experimental results with the results of some other methods, we employ pick signal to noise ratio (PSNR) and structural similarity (SSIM) index as the quality assessor. The numerical results show the adequacy of our proposed method in CVS.