Hierarchical Phoneme Classification for Improved Speech Recognition

被引:10
|
作者
Oh, Donghoon [1 ,2 ]
Park, Jeong-Sik [3 ]
Kim, Ji-Hwan [4 ]
Jang, Gil-Jin [2 ,5 ]
机构
[1] SK Holdings C&C, Gyeonggi Do 13558, South Korea
[2] Kyungpook Natl Univ, Sch Elect & Elect Engn, Daegu 41566, South Korea
[3] Hankuk Univ Foreign Studies, Dept English Linguist & Language Technol, Seoul 02450, South Korea
[4] Sogang Univ, Dept Comp Sci & Engn, Seoul 04107, South Korea
[5] Kyungpook Natl Univ, Sch Elect Engn, Daegu 41566, South Korea
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 01期
基金
新加坡国家研究基金会;
关键词
speech recognition; phoneme classification; clustering; recurrent neural networks; NEURAL-NETWORKS; CONSONANTS;
D O I
10.3390/app11010428
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Featured Application Automatic speech recognition; chatbot; voice-assisted control; multimodal man-machine interaction systems. Speech recognition consists of converting input sound into a sequence of phonemes, then finding text for the input using language models. Therefore, phoneme classification performance is a critical factor for the successful implementation of a speech recognition system. However, correctly distinguishing phonemes with similar characteristics is still a challenging problem even for state-of-the-art classification methods, and the classification errors are hard to be recovered in the subsequent language processing steps. This paper proposes a hierarchical phoneme clustering method to exploit more suitable recognition models to different phonemes. The phonemes of the TIMIT database are carefully analyzed using a confusion matrix from a baseline speech recognition model. Using automatic phoneme clustering results, a set of phoneme classification models optimized for the generated phoneme groups is constructed and integrated into a hierarchical phoneme classification method. According to the results of a number of phoneme classification experiments, the proposed hierarchical phoneme group models improved performance over the baseline by 3%, 2.1%, 6.0%, and 2.2% for fricative, affricate, stop, and nasal sounds, respectively. The average accuracy was 69.5% and 71.7% for the baseline and proposed hierarchical models, showing a 2.2% overall improvement.
引用
收藏
页码:1 / 17
页数:17
相关论文
共 50 条
  • [31] Phoneme sequence recognition via DTW-based classification
    Hamooni, Hossein
    Mueen, Abdullah
    Neel, Amy
    KNOWLEDGE AND INFORMATION SYSTEMS, 2016, 48 (02) : 253 - 275
  • [32] Correntropy Based Hierarchical Linear Dynamical System For Speech Recognition
    Singh, Rishabh
    Principe, Jose C.
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [33] Calibrating AdaBoost for phoneme classification
    Gábor Gosztolya
    Róbert Busa-Fekete
    Soft Computing, 2019, 23 : 115 - 128
  • [34] Feature Selection Using Game Theory for Phoneme Based Speech Recognition
    Rekha, J. Ujwala
    Chatrapati, K. Shahu
    Babu, A. Vinaya
    2014 INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING AND INFORMATICS (IC3I), 2014, : 962 - 966
  • [35] Calibrating AdaBoost for phoneme classification
    Gosztolya, Gabor
    Busa-Fekete, Robert
    SOFT COMPUTING, 2019, 23 (01) : 115 - 128
  • [36] Integration of phoneme-subspaces using ICA for speech feature extraction and recognition
    Park, Hyunsin
    Takiguchi, Tetsuya
    Ariki, Yasuo
    2008 HANDS-FREE SPEECH COMMUNICATION AND MICROPHONE ARRAYS, 2008, : 149 - 152
  • [37] ADAPTATION TO A SPEAKERS VOICE IN A SPEECH RECOGNITION SYSTEM BASED ON SYNTHETIC PHONEME REFERENCES
    BLOMBERG, M
    SPEECH COMMUNICATION, 1991, 10 (5-6) : 453 - 461
  • [38] A hierarchical Bayesian model for continuous speech recognition
    Mouria-beji, F
    PATTERN RECOGNITION LETTERS, 2002, 23 (07) : 773 - 781
  • [39] DNN-based automatic speech recognition as a model for human phoneme perception
    Exter, Mats
    Meyer, Bernd T.
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 615 - 619
  • [40] A hierarchical point process model for speech recognition
    Jansen, Aren
    Niyogi, Partha
    2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4093 - 4096