A New Corpus of Elderly Japanese Speech for Acoustic Modeling, and a Preliminary Investigation of Dialect-Dependent Speech Recognition

被引:0
作者
Fukuda, Meiko [1 ]
Nishimura, Ryota [1 ]
Nishizaki, Hiromitsu [2 ]
Iribe, Yurie [3 ]
Kitaoka, Norihide [4 ]
机构
[1] Tokushima Univ, Dept Comp Sci, Tokushima, Japan
[2] Univ Yamanashi, Fac Engn, Grad Sch Interdisciplinary Res, Kofu, Yamanashi, Japan
[3] Aichi Prefectural Univ, Sch Informat Sci & Technol, Nagakute, Aichi, Japan
[4] Toyohashi Univ Technol, Dept Comp Sci & Engn, Toyohashi, Aichi, Japan
来源
2019 22ND CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA) | 2019年
关键词
elderly; Japanese; corpus; speech recognition; adaptation; dialect; DEMENTIA;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We have constructed a new speech data corpus consisting of the utterances of 221 elderly Japanese people (average age: 79.2) with the aim of improving the accuracy of automatic speech recognition (ASR) for the elderly. ASR is a beneficial modality for people with impaired vision or limited hand movement, including the elderly. However, speech recognition systems using standard recognition models, especially acoustic models, have been unable to achieve satisfactory performance for the elderly. Thus, creating more accurate acoustic models of the speech of elderly users is essential for improving speech recognition for the elderly. Using our new corpus, which includes the speech of elderly people living in three regions of Japan, we conducted speech recognition experiments using a variety of DNN-HNN acoustic models. As training data for our acoustic models, we examined whether a standard adult Japanese speech corpus (JNAS), an elderly speech corpus (S-JNAS) or a spontaneous speech corpus (CSJ) was most suitable, and whether or not adaptation to the dialect of each region improved recognition results. We adapted each of our three acoustic models to all of our speech data, and then re-adapt them using speech from each region. Without adaptation, the best recognition results were obtained when using the S-JNAS trained acoustic models (total corpus: 21.85% Word Error Rate). However, after adaptation of our acoustic models to our entire corpus, the CSJ trained models achieved the lowest WERs (entire corpus: 17.42%). Moreover, after readaptation to each regional dialect, the CSJ trained acoustic model with adaptation to regional speech data showed tendencies of improved recognition rates. We plan to collect more utterances from all over Japan, so that our corpus can be used as a key resource for elderly speech recognition in Japanese. We also hope to achieve further improvement in recognition performance for elderly speech.
引用
收藏
页码:78 / 83
页数:6
相关论文
共 50 条
  • [31] A CORPUS-BASED ANALYSIS OF AGE-RELATED CHANGES IN THE ACOUSTIC FEATURES OF ELDERLY TO SUPER ELDERLY SPEECH
    Fukuda, Meiko
    Sugiyama, Masakazu
    Nishimura, Ryota
    Iribe, Yurie
    Yamamoto, Kazumasa
    Kitaoka, Norihide
    2022 25TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA 2022), 2022,
  • [32] DEVELOPMENT OF THE CUHK ELDERLY SPEECH RECOGNITION SYSTEM FOR NEUROCOGNITIVE DISORDER DETECTION USING THE DEMENTIABANK CORPUS
    Ye, Zi
    Hu, Shoukang
    Li, Jinchao
    Xie, Xurong
    Geng, Mengzhe
    Yu, Jianwei
    Xu, Junhao
    Xue, Boyang
    Liu, Shansong
    Liu, Xunying
    Meng, Helen
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6433 - 6437
  • [33] Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition
    Kirchhoff, K
    Vergyri, D
    SPEECH COMMUNICATION, 2005, 46 (01) : 37 - 51
  • [34] Acoustic feature analysis and discriminative modeling of filled pauses for spontaneous speech recognition
    Wu, CH
    Yan, GL
    JOURNAL OF VLSI SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2004, 36 (2-3): : 91 - 104
  • [35] Acoustic Feature Analysis and Discriminative Modeling of Filled Pauses for Spontaneous Speech Recognition
    Chung-Hsien Wu
    Gwo-Lang Yan
    Journal of VLSI signal processing systems for signal, image and video technology, 2004, 36 : 91 - 104
  • [36] AN INVESTIGATION OF SUBSPACE MODELING FOR PHONETIC AND SPEAKER VARIABILITY IN AUTOMATIC SPEECH RECOGNITION
    Rose, Richard
    Yin, Shou-Chun
    Tang, Yun
    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 4508 - 4511
  • [37] THE NEW DELFT UNIVERSITY OF TECHNOLOGY DATA CORPUS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Chitu, Alin G.
    Rothkrantz, Leon J. M.
    EUROMEDIA'2009, 2009, : 63 - 69
  • [38] GEO-LOCATION DEPENDENT DEEP NEURAL NETWORK ACOUSTIC MODEL FOR SPEECH RECOGNITION
    Ye, Guoli
    Liu, Chaojun
    Gong, Yifan
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5870 - 5874
  • [39] Acoustic Modeling Based on Deep Learning for Low-Resource Speech Recognition: An Overview
    Yu, Chongchong
    Kang, Meng
    Chen, Yunbing
    Wu, Jiajia
    Zhao, Xia
    IEEE ACCESS, 2020, 8 : 163829 - 163843
  • [40] Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition
    Zhou, Wei
    Zeineldeen, Mohammad
    Zheng, Zuoyun
    Schlueter, Ralf
    Ney, Hermann
    INTERSPEECH 2021, 2021, : 2886 - 2890