We present a scheme for optimal VLSI layout and packaging of butterfly networks under the Thompson model, the multilayer grid model, and the hierarchical layout model. We show that when L layers of wires are available, an N-node butterfly network can be laid out with area 4N2/L2 log22 N+o(N2/L2 log2 N), maximum wire length 2N/L log2 N+o(N/L log N), volume 4N2/L log22 N+o(N2/L log2 N), under the multilayer 2-D grid model, where only one active layer (for network nodes) is required and L layers of wires are available. Our layout scheme allows us to partition an N-node butterfly network into Θ(N1/l log1-1/l N)-node clusters with an average of dic≈4l-4/log2 N( = O(1/log N) for any constant integer l) inter-cluster links per node, leading to optimal layout and packaging at the same time under the hierarchical layout model. The scalability of our layouts are optimal in that we can allow each of O(N/log N) nodes to occupy an area as large as o(N/L2 log N) and each of the remaining N-o(N) network nodes to occupy an area as large as O(N/L2 log2 N), without increasing the leading constants of layout area, volume, or maximum wire length.