Automatic phoneme recognition by deep neural networks

被引:0
作者
Pereira, Bianca Valeria L. [1 ]
de Carvalho, Mateus B. F. [1 ]
Alves, Pedro Augusto A. da S. de A. Nava [1 ]
Ribeiro, Paulo Rogerio de A. [1 ]
de Oliveira, Alexandre Cesar M. [1 ]
de Almeida Neto, Areolino [1 ]
机构
[1] Fed Univ Maranhao UFMA, Ave Portugueses 1966 Vila Bacanga, BR-65080805 Sao Luis, Maranhao, Brazil
关键词
Speech recognition; Deep neural networks; Computer vision; Deep learning;
D O I
10.1007/s11227-024-06098-6
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This work presents a lightweight phoneme recognition model using object detection techniques. This model is mainly proposed to run on devices with low processing power, such as tablets and mobile phones. The use of the combination of hardware network architecture research complemented by the NetAdapt algorithm has led to the use of a simpler and lighter network architecture called MobileNet. The MobileNetV3 convolutional network architecture was combined with the Single-Shot Detection. The databases used in model training were TIMIT and LibriSpeech, both have spoken audios in English. To generate a graphical representation using the audiobases, for each audio, its spectrogram was calculated on the Mel scale. To train the algorithm of phoneme location detection, the temporal position of the occurrence of each phoneme in respective spectrogram is used. Additionally, it was necessary to increase the training dataset, in order to provide improvement in the generalization of the model. Therefore, the two databases were joined and data augmentation techniques were applied to audios. The main idea was to achieve learning using a lightweight architecture that can be used on devices with low processing power, such as tablets and mobile phones. Thus, this research used the MobileNet-Large architecture, which obtained an accuracy of 0.72 mAP@0.5IOU. For comparison, the MobileNet-Small architecture was also used, which obtained an accuracy of 0.63 mAP@0.5IOU.
引用
收藏
页码:16654 / 16678
页数:25
相关论文
共 43 条
  • [1] Towards Deep Object Detection Techniques for Phoneme Recognition
    Algabri, Mohammed
    Mathkour, Hassan
    Bencherif, Mohamed Abdelkader
    Alsulaiman, Mansour
    Mekhtiche, Mohamed Amine
    [J]. IEEE ACCESS, 2020, 8 : 54663 - 54680
  • [2] Bresolin AdA, 2008, THESIS FEDERAL U RIO
  • [3] Coniam D., 1999, SYSTEM, V27, DOI [DOI 10.1016/S0346-251X, https://doi.org/10.1016/S0346-251X(98)00049-9, DOI 10.1016/S0346-251X(98)00049-9, 10.1016/S0346-251X(98)00049-9]
  • [4] Dai W, 2017, INT CONF ACOUST SPEE, P421, DOI 10.1109/ICASSP.2017.7952190
  • [5] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [6] Scalable Object Detection using Deep Neural Networks
    Erhan, Dumitru
    Szegedy, Christian
    Toshev, Alexander
    Anguelov, Dragomir
    [J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 2155 - 2162
  • [7] The Pascal Visual Object Classes (VOC) Challenge
    Everingham, Mark
    Van Gool, Luc
    Williams, Christopher K. I.
    Winn, John
    Zisserman, Andrew
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2010, 88 (02) : 303 - 338
  • [8] Fan RC, 2018, 2018 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING (ICALIP), P349, DOI 10.1109/ICALIP.2018.8455731
  • [9] CEPSTRAL ANALYSIS TECHNIQUE FOR AUTOMATIC SPEAKER VERIFICATION
    FURUI, S
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1981, 29 (02): : 254 - 272
  • [10] Convolutional Neural Networks for Phoneme Recognition
    Glackin, Cornelius
    Wall, Julie
    Chollet, Gerard
    Dugan, Nazim
    Cannings, Nigel
    [J]. PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS (ICPRAM 2018), 2018, : 190 - 195