A Comparative Study on Selecting Acoustic Modeling Units for WFST-based Mongolian Speech Recognition

被引：0

作者：

Wang Yonghe ^{[1
]}

Bao, Feilong ^{[1
]}

Gao, Gaunglai ^{[1
]}

机构：

[1] Inner Mongolia Univ, Coll Comp Sci, 235 West Coll Rd, Hohhot 010021, Inner Mongolia, Peoples R China

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2023年 / 22卷 / 10期

关键词：

Mongolian; speech recognition; acoustic modeling unit; alignment model; WFST;

D O I：

10.1145/3617830

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Traditional weighted finite-state transducer- (WFST) based Mongolian automatic speech recognition (ASR) systems use phonemes as pronunciation lexicon modeling units. However, Mongolian is an agglutinative, low-resource language, and building an ASR system based on the phoneme pronunciation lexicon remains a challenge for various reasons. First, the phoneme pronunciation lexicon manually constructed by Mongolian linguists is finite, which is usually used to build a grapheme-to-phoneme conversion (G2P) model to frequently expand new words. However, the data sparsity decreases the robustness of the G2P model and affects the performance of the final ASR system. Second, homophones and polysyllabic words are common in Mongolian, which has a certain impact on the construction of the Mongolian acoustic model. To address these problems, in this work, we first propose a grapheme-to-phoneme alignment model to obtain the mapping relationship between phonemes and subword units. Then, we construct an acoustic subword segmentation set to segment words directly instead of using the traditional G2P method to predict phoneme sequences to expand the pronunciation lexicon. Further, by analyzing the Mongolian encoding form, we also propose an acoustic subword modeling units construction method that removes control characters. Finally, we investigate various acoustic subword modeling units for pronunciation lexicon construction for the Mongolian ASR system. Experiments on a Mongolian dataset with 325 hours of training show that the pronunciation lexicon based on the acoustic subword modeling unit can effectively construct the WFST-based Mongolian ASR system. Further, removing the control characters when building the acoustic subword modeling unit can further improve the ASR system performance.

引用

页数：20

共 44 条

[11] AN END-TO-END SPEECH ACCENT RECOGNITION METHOD BASED ON HYBRID CTC/ATTENTION TRANSFORMER ASR
Gao, Qiang
Wu, Haiwei
Sun, Yanqing
Duan, Yitao
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7253 - 7257
[12] Graves A., 2006, P 23 INT C MACH LEAR, P369
[13] Deep Neural Networks for Acoustic Modeling in Speech Recognition
Hinton, Geoffrey
Deng, Li
Yu, Dong
Dahl, George E.
Mohamed, Abdel-rahman
Jaitly, Navdeep
Senior, Andrew
Vanhoucke, Vincent
Patrick Nguyen
Sainath, Tara N.
Kingsbury, Brian
[J]. IEEE SIGNAL PROCESSING MAGAZINE, 2012, 29 (06) : 82 - 97
[14] Huang Gaoce, 2022, 2022 IEEE 10th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), P1906, DOI 10.1109/ITAIC54216.2022.9836706
[15] Janhunen J., 2006, The Mongolic Languages
[16] Killer M., 2003, P ANN C INT SPEECH C, P3141
[17] Kudo T, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P66
[18] Subword Regularization: An Analysis of Scalability and Generalization for End-to-End Automatic Speech Recognition
Lakomkin, Egor
Heymann, Jahn
Sklyar, Ilya
Wiesler, Simon
[J]. INTERSPEECH 2020, 2020, : 3600 - 3604
[19] Liptchinsky V, 2019, Arxiv, DOI arXiv:1712.09444
[20] Mongolian Grapheme to Phoneme Conversion by Using Hybrid Approach
Liu, Zhinan
Bao, Feilong
Gao, Guanglai
Suburi
[J]. NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT I, 2018, 11108 : 40 - 50

← 1 2 3 4 5 →