Hierarchical Phoneme Classification for Improved Speech Recognition

被引：10

作者：

Oh, Donghoon ^{[1
,2
]}

Park, Jeong-Sik ^{[3
]}

Kim, Ji-Hwan ^{[4
]}

Jang, Gil-Jin ^{[2
,5
]}

机构：

[1] SK Holdings C&C, Gyeonggi Do 13558, South Korea

[2] Kyungpook Natl Univ, Sch Elect & Elect Engn, Daegu 41566, South Korea

[3] Hankuk Univ Foreign Studies, Dept English Linguist & Language Technol, Seoul 02450, South Korea

[4] Sogang Univ, Dept Comp Sci & Engn, Seoul 04107, South Korea

[5] Kyungpook Natl Univ, Sch Elect Engn, Daegu 41566, South Korea

来源：

APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 01期

基金：

新加坡国家研究基金会;

关键词：

speech recognition; phoneme classification; clustering; recurrent neural networks; NEURAL-NETWORKS; CONSONANTS;

D O I：

10.3390/app11010428

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Featured Application Automatic speech recognition; chatbot; voice-assisted control; multimodal man-machine interaction systems. Speech recognition consists of converting input sound into a sequence of phonemes, then finding text for the input using language models. Therefore, phoneme classification performance is a critical factor for the successful implementation of a speech recognition system. However, correctly distinguishing phonemes with similar characteristics is still a challenging problem even for state-of-the-art classification methods, and the classification errors are hard to be recovered in the subsequent language processing steps. This paper proposes a hierarchical phoneme clustering method to exploit more suitable recognition models to different phonemes. The phonemes of the TIMIT database are carefully analyzed using a confusion matrix from a baseline speech recognition model. Using automatic phoneme clustering results, a set of phoneme classification models optimized for the generated phoneme groups is constructed and integrated into a hierarchical phoneme classification method. According to the results of a number of phoneme classification experiments, the proposed hierarchical phoneme group models improved performance over the baseline by 3%, 2.1%, 6.0%, and 2.2% for fricative, affricate, stop, and nasal sounds, respectively. The average accuracy was 69.5% and 71.7% for the baseline and proposed hierarchical models, showing a 2.2% overall improvement.

引用

页码：1 / 17

页数：17

共 50 条

[41] Conversion from Phoneme Based to Grapheme Based Acoustic Models for Speech Recognition
Zgank, Andrej
Kacic, Zdravko
INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1587 - 1590
[42] Rule-Based Embedded HMMs Phoneme Classification to Improve Qur'anic Recitation Recognition
Alqadasi, Ammar Mohammed Ali
Sunar, Mohd Shahrizal
Turaev, Sherzod
Abdulghafor, Rawad
Salam, Md Sah Hj
Alashbi, Abdulaziz Ali Saleh
Salem, Ali Ahmed
Ali, Mohammed A. H.
ELECTRONICS, 2023, 12 (01)
[43] Using Vector of Fractal Dimensions for Feature Reduction and Phoneme Recognition and Classification
Hosseini, S. Abolfazl
Ghassemian, Hassan
Alizadeh, Roya
2012 20TH TELECOMMUNICATIONS FORUM (TELFOR), 2012, : 748 - 751
[44] Effect of aging on speech features and phoneme recognition: A study on Bengali voicing vowels
Das B.
Mandal S.
Mitra P.
Basu A.
Das, B. (biswajit.net@gmail.com), 1600, Kluwer Academic Publishers (16): : 19 - 31
[45] EVIDENCE FOR THE STRENGTH OF THE RELATIONSHIP BETWEEN AUTOMATIC SPEECH RECOGNITION AND PHONEME ALIGNMENT PERFORMANCE
Baghai-Ravary, Ladan
2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5262 - 5265
[46] LANGUAGE DEPENDENT UNIVERSAL PHONEME POSTERIOR ESTIMATION FOR MIXED LANGUAGE SPEECH RECOGNITION
Imseng, David
Bourlard, Herve
Magimai-Doss, Mathew
Dines, John
2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 5012 - 5015
[47] Phoneme Classification Using the Auditory Neurogram
Alam, Md. Shariful
Zilany, Muhammad S. A.
Jassim, Wissam A.
Ahmad, Mohd Yazed
IEEE ACCESS, 2017, 5 : 633 - 642
[48] CROSS-LINGUAL PHONEME MAPPING FOR LANGUAGE ROBUST CONTEXTUAL SPEECH RECOGNITION
Patel, Ami
Li, David
Cho, Eunjoon
Aleksic, Petar
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5924 - 5928
[49] Deep belief networks for phoneme recognition in continuous Tamil speech-an analysis
Raguram, Laxmi Sree Baskaran
Shanmugam, Vijaya Madhaya
TRAITEMENT DU SIGNAL, 2017, 34 (3-4) : 137 - 151
[50] AN IMPROVED METHOD FOR SPEECH/SPEAKER RECOGNITION
Gaafar, Tamer S.
Bakr, Hitham M. Abo
Abdalla, Mahmoud I.
2014 INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION (ICIEV), 2014,

← 1 2 3 4 5 →