The demand for faster computation speed in modern digital signal processing is huge. However, the computation speed that single processor can provide is limited. To address this demand, both distributed system and parallel processing are becoming a requirement in an embedded system. Therefore, research about algorithm and application of parallel processing is very important to be conducted. Implementing MPI's standard to an embedded system will increase the application portability, therefore parallel programming will be easier to be implemented. This paper presents a novel design and implementation of MPI on top of our microkernel named FLoW which are built and run on an embedded system. To decrease communication latency, we propose a communication layer design based on MPI. On this layer, a process manager is made to handle multi-processes and routing services mechanism. In addition, a mailbox system is created to temporarily keep the message which is sent when the collective operation occurs. From our experiments, the time required to complete the data transmission process ranges from 400 to 500 microseconds for each process, and in parallel task testing using MPI, the speedup can achieve up to 40-50%.