Due to progressing process technology, yield of chips is reduced by timing violation caused by delay variation of gates and wires in fabrication. Recently, post-silicon delay tuning, which inserts programmable delay elements (PDEs) into clock tree before the fabrication and sets the delays of the PDEs to recover the timing violation after the fabrication, is promising to improve the yield. In an existing method, since the PDE is constructed by a buffer chain and a demultiplexer and it is inserted for each register, power consumption and circuit area are increased drastically in comparison with conventional clock synchronous circuits. In this paper, a PDE structure is proposed to reduce the circuit area. Moreover, a clustering method, in which some PDEs are merged into a PDE and a PIM is inserted for multiple registers, is proposed to reduce the power consumption and the circuit area. In computational experiments, the proposed method reduced the power consumption and the circuit area in comparison with the existing method.