Customized deep learning based Turkish automatic speech recognition system supported by language model

被引：2

作者：

Gormez, Yasin ^{[1
]}

机构：

[1] Sivas Cumhuriyet Univ, Management Informat Syst, Sivas, Merkez, Turkiye

来源：

PEERJ COMPUTER SCIENCE | 2024年 / 10卷

关键词：

Automatic speech recognition; Deep learning; Turkish speech recognation; Machine learning; Word normalization; Sequence to sequence model;

D O I：

10.7717/peerj-cs.1981

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Background. In today's world, numerous applications integral to various facets of daily life include automatic speech recognition methods. Thus, the development of a successful automatic speech recognition system can significantly augment the convenience of people's daily routines. While many automatic speech recognition systems have been established for widely spoken languages like English, there has been insufficient progress in developing such systems for less common languages such as Turkish. Moreover, due to its agglutinative structure, designing a speech recognition system for Turkish presents greater challenges compared to other language groups. Therefore, our study focused on proposing deep learning models for automatic speech recognition in Turkish, complemented by the integration of a language model. Methods. In our study, deep learning models were formulated by incorporating convolutional neural networks, gated recurrent units, long short-term memories, and transformer layers. The Zemberek library was employed to craft the language model to improve system performance. Furthermore, the Bayesian optimization method was applied to fine-tune the hyper-parameters of the deep learning models. To evaluate the model's performance, standard metrics widely used in automatic speech recognition systems, specifically word error rate and character error rate scores, were employed. Results. Upon reviewing the experimental results, it becomes evident that when optimal hyper-parameters are applied to models developed with various layers, the scores are as follows: Without the use of a language model, the Turkish Microphone Speech Corpus dataset yields scores of 22.2 -word error rate and 14.05-character error rate, while the Turkish Speech Corpus dataset results in scores of 11.5 -word error rate and 4.15 character error rate. Upon incorporating the language model, notable improvements were observed. Specifically, for the Turkish Microphone Speech Corpus dataset, the word error rate score decreased to 9.85, and the character error rate score lowered to 5.35. Similarly, the word error rate score improved to 8.4, and the character error rate score decreased to 2.7 for the Turkish Speech Corpus dataset. These results demonstrate that our model outperforms the studies found in the existing literature.

引用

页数：22

共 45 条

[1] Convolutional Neural Networks for Speech Recognition [J].

Abdel-Hamid, Ossama ;

Mohamed, Abdel-Rahman ;

Jiang, Hui ;

Deng, Li ;

Penn, Gerald ;

Yu, Dong .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545

[2]

Akin AA, 2012, 2012 20 SIGN PROC CO, P1, DOI [10.1109/SIU.2012.6204752, DOI 10.1109/SIU.2012.6204752]

[3]

Ari A., 2019, Firat Universitesi Muhendislik Bilimleri Dergisi, V31, P443, DOI [10.35234/fumbd.545161, DOI 10.35234/FUMBD.545161]

[4]

Arora S.J., 2012, INT J COMPUT APPL, V60, P34

[5] A detailed survey of Turkish automatic speech recognition [J].

Arslan, Recep Sinan ;

Barisci, Necaattin .

TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2020, 28 (06) :3253-3269

[6]

Cayir A. N., 2021, 2021 3 INT C HUM COM, P1, DOI DOI 10.1109/HORA52670.2021.9461395

[7]

Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105

[8] Language modelling for Turkish as an agglutinative language [J].

Çiloglu, T ;

Çömez, M ;

Sahin, S .

PROCEEDINGS OF THE IEEE 12TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, 2004, :461-462

[9]

Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506

[10] IGPRED-MultiTask: A Deep Learning Model to Predict Protein Secondary Structure, Torsion Angles and Solvent Accessibility [J].

Gormez, Yasin ;

Aydin, Zafer .

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2023, 20 (02) :1104-1113

← 1 2 3 4 5 →