Hierarchical Phoneme Classification for Improved Speech Recognition

被引：10

作者：

Oh, Donghoon ^{[1
,2
]}

Park, Jeong-Sik ^{[3
]}

Kim, Ji-Hwan ^{[4
]}

Jang, Gil-Jin ^{[2
,5
]}

机构：

[1] SK Holdings C&C, Gyeonggi Do 13558, South Korea

[2] Kyungpook Natl Univ, Sch Elect & Elect Engn, Daegu 41566, South Korea

[3] Hankuk Univ Foreign Studies, Dept English Linguist & Language Technol, Seoul 02450, South Korea

[4] Sogang Univ, Dept Comp Sci & Engn, Seoul 04107, South Korea

[5] Kyungpook Natl Univ, Sch Elect Engn, Daegu 41566, South Korea

来源：

APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 01期

基金：

新加坡国家研究基金会;

关键词：

speech recognition; phoneme classification; clustering; recurrent neural networks; NEURAL-NETWORKS; CONSONANTS;

D O I：

10.3390/app11010428

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Featured Application Automatic speech recognition; chatbot; voice-assisted control; multimodal man-machine interaction systems. Speech recognition consists of converting input sound into a sequence of phonemes, then finding text for the input using language models. Therefore, phoneme classification performance is a critical factor for the successful implementation of a speech recognition system. However, correctly distinguishing phonemes with similar characteristics is still a challenging problem even for state-of-the-art classification methods, and the classification errors are hard to be recovered in the subsequent language processing steps. This paper proposes a hierarchical phoneme clustering method to exploit more suitable recognition models to different phonemes. The phonemes of the TIMIT database are carefully analyzed using a confusion matrix from a baseline speech recognition model. Using automatic phoneme clustering results, a set of phoneme classification models optimized for the generated phoneme groups is constructed and integrated into a hierarchical phoneme classification method. According to the results of a number of phoneme classification experiments, the proposed hierarchical phoneme group models improved performance over the baseline by 3%, 2.1%, 6.0%, and 2.2% for fricative, affricate, stop, and nasal sounds, respectively. The average accuracy was 69.5% and 71.7% for the baseline and proposed hierarchical models, showing a 2.2% overall improvement.

引用

页码：1 / 17

页数：17

共 50 条

[31] Phoneme sequence recognition via DTW-based classification
Hamooni, Hossein
Mueen, Abdullah
Neel, Amy
KNOWLEDGE AND INFORMATION SYSTEMS, 2016, 48 (02) : 253 - 275
[32] Correntropy Based Hierarchical Linear Dynamical System For Speech Recognition
Singh, Rishabh
Principe, Jose C.
2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
[33] Calibrating AdaBoost for phoneme classification
Gábor Gosztolya
Róbert Busa-Fekete
Soft Computing, 2019, 23 : 115 - 128
[34] Feature Selection Using Game Theory for Phoneme Based Speech Recognition
Rekha, J. Ujwala
Chatrapati, K. Shahu
Babu, A. Vinaya
2014 INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING AND INFORMATICS (IC3I), 2014, : 962 - 966
[35] Calibrating AdaBoost for phoneme classification
Gosztolya, Gabor
Busa-Fekete, Robert
SOFT COMPUTING, 2019, 23 (01) : 115 - 128
[36] Integration of phoneme-subspaces using ICA for speech feature extraction and recognition
Park, Hyunsin
Takiguchi, Tetsuya
Ariki, Yasuo
2008 HANDS-FREE SPEECH COMMUNICATION AND MICROPHONE ARRAYS, 2008, : 149 - 152
[37] ADAPTATION TO A SPEAKERS VOICE IN A SPEECH RECOGNITION SYSTEM BASED ON SYNTHETIC PHONEME REFERENCES
BLOMBERG, M
SPEECH COMMUNICATION, 1991, 10 (5-6) : 453 - 461
[38] A hierarchical Bayesian model for continuous speech recognition
Mouria-beji, F
PATTERN RECOGNITION LETTERS, 2002, 23 (07) : 773 - 781
[39] DNN-based automatic speech recognition as a model for human phoneme perception
Exter, Mats
Meyer, Bernd T.
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 615 - 619
[40] A hierarchical point process model for speech recognition
Jansen, Aren
Niyogi, Partha
2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4093 - 4096

← 1 2 3 4 5 →