English Speech Recognition System Based on Long Short Term Memory Algorithm

被引：0

作者：

Qian, Yuanyuan ^{[1
]}

机构：

[1] Changchun Univ Architecture & Civil Engn, Changchun, Peoples R China

来源：

2024 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER APPLICATIONS, ICAICA | 2024年

关键词：

Speech recognition; Deep learning; Long-term memory network; Language model;

D O I：

10.1109/ICAICA63239.2024.10823023

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper discusses the development and execution of an English speech recognition system utilizing the Long Short-Term Memory (LSTM) algorithm. LSTM, a specific form of recurrent neural network (RNN), is extensively employed in deep learning due to its ability to manage long-term dependencies. It has found broad applications in speech recognition tasks. The system adopts an end-to-end architecture, starting from the front-end signal processing module, capturing the key features of audio signals using feature extraction techniques such as Mel Frequency Cepstral Coefficients (MFCCs), and then training the LSTM model on these features to achieve accurate transcription of the input speech. In the model training phase, a large dataset of English speech was used, including standard TIMIT pronunciation dictionaries, to ensure that the model could maintain good generalization ability under different accents and pronunciation conditions. Experimental results show that the error rate of the proposed LSTM-based speech recognition system is significantly lower than that of traditional HMM-based methods on various test sets, especially in handling recordings with background noise, demonstrating stronger robustness. Furthermore, the system also shows good real-time processing capabilities, which are expected to be widely applied in future intelligent voice assistants, telephone customer service systems, etc. However, the system still has certain limitations in handling non-standard English pronunciations and multi-language mixed scenarios, and future work will focus on solving these problems and further improving the system's flexibility and accuracy.

引用

页码：229 / 233

页数：5

共 13 条

[1]

Bahl L. R., 1986, ICASSP 86 Proceedings. IEEE-IECEJ-ASJ International Conference on Acoustics, Speech and Signal Processing (Cat. No.86CH2243-4), P49

[2]

Baum L., 1972, Inequalities, V3, P1

[3]

Dahl G E, 2011, IEEETransactions on audio, speech, and language processing, V20

[4] AUTOMATIC RECOGNITION OF SPOKEN DIGITS [J].

DAVIS, KH ;

BIDDULPH, R ;

BALASHEK, S .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1952, 24 (06) :637-642

[5]

Denes P, 1959, Journal of the British Institution ofRadio Engineers, V19, P219

[6]

Gales M, 2008, Signal Processing, V1, P195

[7] Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains [J].

Gauvain, Jean-Luc ;

Lee, Chin-Hui .

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1994, 2 (02) :291-298

[8] Reducing the dimensionality of data with neural networks [J].

Hinton, G. E. ;

Salakhutdinov, R. R. .

SCIENCE, 2006, 313 (5786) :504-507

[9] Minimum classification error rate methods for speech recognition [J].

Juang, BH ;

Chou, W ;

Lee, CH .

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1997, 5 (03) :257-265

[10] MAXIMUM-LIKELIHOOD LINEAR-REGRESSION FOR SPEAKER ADAPTATION OF CONTINUOUS DENSITY HIDDEN MARKOV-MODELS [J].

LEGGETTER, CJ ;

WOODLAND, PC .

COMPUTER SPEECH AND LANGUAGE, 1995, 9 (02) :171-185

← 1 2 →