Automatic phoneme recognition by deep neural networks

被引：0

作者：

Pereira, Bianca Valeria L. ^{[1
]}

de Carvalho, Mateus B. F. ^{[1
]}

Alves, Pedro Augusto A. da S. de A. Nava ^{[1
]}

Ribeiro, Paulo Rogerio de A. ^{[1
]}

de Oliveira, Alexandre Cesar M. ^{[1
]}

de Almeida Neto, Areolino ^{[1
]}

机构：

[1] Fed Univ Maranhao UFMA, Ave Portugueses 1966 Vila Bacanga, BR-65080805 Sao Luis, Maranhao, Brazil

来源：

JOURNAL OF SUPERCOMPUTING | 2024年 / 80卷 / 11期

关键词：

Speech recognition; Deep neural networks; Computer vision; Deep learning;

D O I：

10.1007/s11227-024-06098-6

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

This work presents a lightweight phoneme recognition model using object detection techniques. This model is mainly proposed to run on devices with low processing power, such as tablets and mobile phones. The use of the combination of hardware network architecture research complemented by the NetAdapt algorithm has led to the use of a simpler and lighter network architecture called MobileNet. The MobileNetV3 convolutional network architecture was combined with the Single-Shot Detection. The databases used in model training were TIMIT and LibriSpeech, both have spoken audios in English. To generate a graphical representation using the audiobases, for each audio, its spectrogram was calculated on the Mel scale. To train the algorithm of phoneme location detection, the temporal position of the occurrence of each phoneme in respective spectrogram is used. Additionally, it was necessary to increase the training dataset, in order to provide improvement in the generalization of the model. Therefore, the two databases were joined and data augmentation techniques were applied to audios. The main idea was to achieve learning using a lightweight architecture that can be used on devices with low processing power, such as tablets and mobile phones. Thus, this research used the MobileNet-Large architecture, which obtained an accuracy of 0.72 mAP@0.5IOU. For comparison, the MobileNet-Small architecture was also used, which obtained an accuracy of 0.63 mAP@0.5IOU.

引用

页码：16654 / 16678

页数：25

共 43 条

[1] Towards Deep Object Detection Techniques for Phoneme Recognition [J].

Algabri, Mohammed ;

Mathkour, Hassan ;

Bencherif, Mohamed Abdelkader ;

Alsulaiman, Mansour ;

Mekhtiche, Mohamed Amine .

IEEE ACCESS, 2020, 8 :54663-54680

[2]

Bresolin AdA, 2008, THESIS FEDERAL U RIO

[3]

Coniam D., 1999, System, V27, P49, DOI DOI 10.1016/S0346-251X(98)00049-9

[4]

Dai W, 2017, INT CONF ACOUST SPEE, P421, DOI 10.1109/ICASSP.2017.7952190

[5]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[6] Scalable Object Detection using Deep Neural Networks [J].

Erhan, Dumitru ;

Szegedy, Christian ;

Toshev, Alexander ;

Anguelov, Dragomir .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :2155-2162

[7] The Pascal Visual Object Classes (VOC) Challenge [J].

Everingham, Mark ;

Van Gool, Luc ;

Williams, Christopher K. I. ;

Winn, John ;

Zisserman, Andrew .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2010, 88 (02) :303-338

[8]

Fan RC, 2018, 2018 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING (ICALIP), P349, DOI 10.1109/ICALIP.2018.8455731

[9] CEPSTRAL ANALYSIS TECHNIQUE FOR AUTOMATIC SPEAKER VERIFICATION [J].

FURUI, S .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1981, 29 (02) :254-272

[10] Convolutional Neural Networks for Phoneme Recognition [J].

Glackin, Cornelius ;

Wall, Julie ;

Chollet, Gerard ;

Dugan, Nazim ;

Cannings, Nigel .

PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS (ICPRAM 2018), 2018, :190-195

← 1 2 3 4 5 →