A new recursive algorithm for fast computation of two-dimensional discrete cosine transforms (2-D DCT) is derived by converting the 2-D data matrices into 1-D vectors and then using different partition methods for the time and frequency indices. The algorithm first computes the 2-D complex DCT (2-D CCT) and then produces two 2-D DCT outputs simultaneously through a post-addition step. The decomposed form of the 2-D recursive algorithm looks very like a radix-4 FFT algorithm and is in particular suitable for VLSI implementation since the common entries in each row of the butterfly-like matrix are factored out in order to reduce the number of multipliers. A new linear systolic architecture is presented which leads to a hardware-efficient architectural design requiring only logN multipliers plus 3logN adders/subtractors for the computation of two N×N DCTs.