Towards Deep Object Detection Techniques for Phoneme Recognition

被引：22

作者：

Algabri, Mohammed ^{[1
,3
]}

Mathkour, Hassan ^{[1
,3
]}

Bencherif, Mohamed Abdelkader ^{[2
,3
]}

Alsulaiman, Mansour ^{[2
,3
]}

Mekhtiche, Mohamed Amine ^{[2
,3
]}

机构：

[1] King Saud Univ, Comp Sci Dept, Coll Comp & Informat Sci, Riyadh 11543, Saudi Arabia

[2] King Saud Univ, Comp Engn Dept, Coll Comp & Informat Sci, Riyadh 11543, Saudi Arabia

[3] King Saud Univ, CS2R, Riyadh 11543, Saudi Arabia

来源：

IEEE ACCESS | 2020年 / 8卷

关键词：

Object detection; Detectors; Speech recognition; Hidden Markov models; Task analysis; Acoustics; Machine learning; CenterNet; object detection; phoneme recognition; transfer learning; YOLO; CONVOLUTIONAL NEURAL-NETWORK; SPEECH; FEATURES; MODEL;

D O I：

10.1109/ACCESS.2020.2980452

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The use of cutting edge object detection techniques to build an accurate phoneme sequence recognition system for English and Arabic languages is investigated in this study. Recently, numerous techniques have been proposed for object detection in daily life applications using deep learning. In this paper, we propose the use of object detection techniques in speech processing tasks. We selected two state-of-the-art object detectors, namely YOLO and CenterNet, based on a trade-off between detection accuracy and speed. We tackled the problem of phoneme sequence recognition using three systems: the domain transfer learning system (DTS) from image to speech, intra-language transfer leaning system (IaTS) between speech corpora within the same language (English to English), and inter-language transfer learning system (IeTS) between speech corpora from dissimilar languages (English to Arabic). For English phoneme recognition, the Texas Instruments/Massachusetts Institute of Technology (TIMIT) corpus is used to evaluate the performance of the proposed systems. Our IaTS based on the CenterNet detector achieves the best results using the test core set of TIMIT with 15.89% phone error rate (PER). For Arabic phoneme recognition, the best performance, with 7.58% PER, was achieved using the CenterNet. These results show the effectiveness of using object detection techniques in phoneme recognition tasks. Furthermore, based on the findings of this study, speech processing tasks may be treated as object detection tasks.

引用

页码：54663 / 54680

页数：18

共 100 条

[1] Convolutional Neural Networks for Speech Recognition [J].

Abdel-Hamid, Ossama ;

Mohamed, Abdel-Rahman ;

Jiang, Hui ;

Deng, Li ;

Penn, Gerald ;

Yu, Dong .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545

[2]

Abdou S.M., 2018, Comput. Linguistics Speech Image Process. Arabic Lang, V4, P1

[3]

Al Hindi A, 2014, I C COMP SYST APPLIC, P190, DOI 10.1109/AICCSA.2014.7073198

[4]

Alexey A. B, DARKNET

[5] Arabic broadcast news transcription system [J].

Alghamdi, Mansour ;

Elshafei, Moustafa ;

Al-Muhtaseb, Husni .

INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2007, 10 (04) :183-195

[6]

Ali A, 2014, IEEE W SP LANG TECH, P525, DOI 10.1109/SLT.2014.7078629

[7]

Ali M, 2009, J INF TECHNOL RES, V2, P67, DOI [10.4018/jitr.2009062905, 10.4018/jilr.2009062905]

[8]

Alotaibi Y, 2016, IEEE INT SYMP SIGNAL, P11, DOI 10.1109/ISSPIT.2016.7886001

[9] Study on pharyngeal and uvular consonants in foreign accented Arabic for ASR [J].

Alotaibi, Yousef Ajami ;

Muhammad, Ghulam .

COMPUTER SPEECH AND LANGUAGE, 2010, 24 (02) :219-231

[10]

Alsulaiman M., 2013, INFORM J, V16, P4231

← 1 2 3 4 5 6 7 8 9 10 →