Semantic segmentation neural network classifies each pixel of the input image with semantic labels and it is widely used in varying domains, such as remote sensing, autonomous driving, and image analysis. However, neural networks exhibit a high demand for both parallel computing and extensive data processing capabilities, which greatly challenges resource-constrained embedded devices. Therefore, this article designs a field programmable gate array-based semantic segmentation neural network for lightweighting, parallelism, and bandwidth with a strategy of hardware-software co-design. The designed optimized ghost module halves the bandwidth requirement by reordering the internal structure. Group convolution, channel shuffle, and channel attention modules (CAMs) are used to further compress the network and improve accuracy. The segmentation network based on the designed optimized ghost module reduces computational complexity and parameters to 8.2 and 13.5 times, respectively. On the hardware side, an accelerator highly parallel in data input, processing and output is designed. The CAM is divided into two parts: 1) a parameter-intensive section and 2) a computation-intensive section, which are computed in parallel by the ZYNQ's processing system and programmable logic sides, respectively. The accelerator reached 240 MHz frequency in our experiment and the segmentation of a 320 x 320 input took 12.06 ms. It achieves a performance of 198.16 GOPS and consumes only 9.84 W of power for the entire board running at full load, which is suitable for embedded devices with requirements on real time and power consumption.