A novel pre-processing technique of amplitude interpolation for enhancing the classification accuracy of Bengali phonemes

被引：1

作者：

Paul, Bachchu ^{[1
]}

Phadikar, Santanu ^{[2
]}

机构：

[1] Vidyasagar Univ, Dept Comp Sci, Midnapore 721102, W Bengal, India

[2] Maulana Abul Kalam Azad Univ Technol, Dept Comp Sci & Engn, BF-142,Sect 1, Kolkata 700064, W Bengal, India

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2023年 / 82卷 / 05期

关键词：

Phoneme; Diphthong; Lagrange interpolation; Mel frequency cepstral coefficient; Support vector machine; Deep neural network; SPEECH; RECOGNITION; CORPUS;

D O I：

10.1007/s11042-022-13594-5

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In linguistics, phonemes are the atomic sound, called word segmentor play an important role to recognize the word properly. A novel approach of seven Bengali vowels and ten diphthongs (a syllable for the pronunciation of two consecutive vowels) phoneme recognition has been proposed in the paper. In the proposed method, before extracting the feature, a novel pre-processing technique using amplitude interpolation method has been developed to align the starting point of all the phonemes of the same class which in turn boosts the recognition rate. Here seven Bengali vowels and ten diphthongs audio clips uttered by twenty persons (ten times each) of different age group and sex have been recorded to create a data set of 3400 audio samples for the proposed experiment. For each class of phonemes and diphthongs one sample (selected by linguistic) have been considered as a benchmark. Then each of the recorded audio clips is interpolated to match with the benchmark clip of the corresponding phoneme by finding the valleys in the amplitude using Lagrange interpolation technique. After that, 19 MFCC (Mel Frequency Cepstral Co-Efficient) speech features have been extracted from each phoneme of the interpolated audio clips and feed to classify using Support Vector Machine (SVM), k- Nearest Neighbour (KNN) and Deep Neural Network (DNN) classifier and the average classification accuracy obtained for vowels and diphthongs are 94.93% and 94.56% respectively. To check the effectiveness of the proposed pre-processing technique same MFCC features have been extracted from the raw recorded phonemes and feed to same classifiers and average accuracy obtained for vowels and diphthongs are 89.21% and 88.56% respectively which shows the effectiveness of the proposed method. It is also to note that best accuracy obtained using the DNN classifier with the accuracy of 98.16% for vowels and 97% for diphthongs.

引用

页码：7735 / 7755

页数：21