Speech recognition is the machine's ability to recognize words based on speech. Utilization of speech recognition technology into a real form of human-machine interaction. This allows humans to control smart devices more easily. However, the variability of speech such as dialect or accent, vocabulary size, type of recognition, speech speed, environmental noise, and type of microphone will affect the speech recognition rate. In recent years, deep learning approaches such as CNN and BLSTM have been widely used and have provided significant recognition improvements. Inspired by the advantages of CNN in exploiting local interspectral correlations and capturing frequency changes in speech signals and BLSTM in learning the temporal context, this study uses a hybrid CNN and BLSTM models for speech recognition with CTC as decoder. This study uses continuous speech data in Indonesian with five different dialects, namely, Balinese, Bataknese, Javanese, Minangese and Sundanese. There are four test scenarios that carried out sequentially to improve speech recognition performance including layer structure, use of dropouts, number of filters and units, and types of input features. The first three scenarios only use 13 coefficients MFCC without deltas features as the input feature. The results showed that the combination of 2 CNN layers with 64 filters and 2 BLSTM layers with 128 units and the application of a dropout with rate of 0.2 on all hidden layers achieve WER of 37.31%. And the addition of delta and double delta features can reduce the recognition error and achieve WER 10.80%.