Recent Progress in the CUHK Dysarthric Speech Recognition System

被引：39

作者：

Liu, Shansong ^{[1
]}

Geng, Mengzhe ^{[1
]}

Hu, Shoukang ^{[1
]}

Xie, Xurong ^{[1
,2
]}

Cui, Mingyu ^{[1
]}

Yu, Jianwei ^{[1
]}

Liu, Xunying ^{[1
]}

Meng, Helen ^{[1
]}

机构：

[1] Chinese Univ Hong Kong, Hong Kong 999077, Peoples R China

[2] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen 100049, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2021年 / 29卷

关键词：

Speech recognition; Task analysis; Speech processing; Visualization; Data models; Adaptation models; Phonetics; Disordered speech recognition; speaker adaptation; data augmentation; multimodal speech recognition; DATA AUGMENTATION METHOD; NEURAL-NETWORKS; FEATURES; SPEAKERS;

D O I：

10.1109/TASLP.2021.3091805

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Despite the rapid progress of automatic speech recognition (ASR) technologies in the past few decades, recognition of disordered speech remains a highly challenging task to date. Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based ASR technologies that predominantly target normal speech. This paper presents recent research efforts at the Chinese University of Hong Kong (CUHK) to improve the performance of disordered speech recognition systems on the largest publicly available UASpeech dysarthric speech corpus. A set of novel modelling techniques including neural architectural search, data augmentation using spectra-temporal perturbation, model based speaker adaptation and cross-domain generation of visual features within an audio-visual speech recognition (AVSR) system framework were employed to address the above challenges. The combination of these techniques produced the lowest published word error rate (WER) of 25.21% on the UASpeech test set 16 dysarthric speakers, and an overall WER reduction of 5.4% absolute (17.6% relative) over the CUHK 2018 dysarthric speech recognition system featuring a 6-way DNN system combination and cross adaptation of out-of-domain normal speech data trained systems. Bayesian model adaptation further allows rapid adaptation to individual dysarthric speakers to be performed using as little as 3.06 seconds of speech. The efficacy of these techniques were further demonstrated on a CUDYS Cantonese dysarthric speech recognition task.

引用

页码：2267 / 2281

页数：15

共 84 条

[1] Deep Audio-Visual Speech Recognition [J].

Afouras, Triantafyllos ;

Chung, Joon Son ;

Senior, Andrew ;

Vinyals, Oriol ;

Zisserman, Andrew .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :8717-8727

[2]

Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137, DOI 10.1109/ICSLP.1996.607807

[3]

[Anonymous], 2012, INT C MACHINE LEARNI

[4]

[Anonymous], 1999, P EUR

[5]

[Anonymous], 2017, P ICLR

[6]

[Anonymous], 2011, Proceedings of ASRU 2011

[7]

[Anonymous], 2017, P ICLR

[8]

[Anonymous], 2006, P 23 INT C MACHINE L, DOI 10.1145/1143844.1143891

[9]

Bell PJ, 2012, IEEE W SP LANG TECH, P324, DOI 10.1109/SLT.2012.6424244

[10]

Cai H., 2019, P ICLR

← 1 2 3 4 5 6 7 8 9 →