Automated Speech Audiometry: Can It Work Using Open-Source Pre-Trained Kaldi-NL Automatic Speech Recognition?

被引:4
作者
Araiza-Illan, Gloria [1 ,2 ]
Meyer, Luke [1 ,2 ]
Truong, Khiet P. [3 ]
Baskent, Deniz [1 ,2 ]
机构
[1] Univ Groningen, Univ Med Ctr Groningen, Dept Otorhinolaryngol Head & Neck Surg, Groningen, Netherlands
[2] Univ Groningen, Univ Med Ctr Groningen, WJ Kolff Inst Biomed Engn & Mat Sci, Groningen, Netherlands
[3] Univ Twente, Human Media Interact, Enschede, Netherlands
关键词
speech audiometry; speech perception; automatic speech recognition; speech-in-noise hearing test; digits-in-noise test; NOISE; INTELLIGIBILITY; THRESHOLD; LISTENERS; HEARING;
D O I
10.1177/23312165241229057
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
A practical speech audiometry tool is the digits-in-noise (DIN) test for hearing screening of populations of varying ages and hearing status. The test is usually conducted by a human supervisor (e.g., clinician), who scores the responses spoken by the listener, or online, where software scores the responses entered by the listener. The test has 24-digit triplets presented in an adaptive staircase procedure, resulting in a speech reception threshold (SRT). We propose an alternative automated DIN test setup that can evaluate spoken responses whilst conducted without a human supervisor, using the open-source automatic speech recognition toolkit, Kaldi-NL. Thirty self-reported normal-hearing Dutch adults (19-64 years) completed one DIN + Kaldi-NL test. Their spoken responses were recorded and used for evaluating the transcript of decoded responses by Kaldi-NL. Study 1 evaluated the Kaldi-NL performance through its word error rate (WER), percentage of summed decoding errors regarding only digits found in the transcript compared to the total number of digits present in the spoken responses. Average WER across participants was 5.0% (range 0-48%, SD = 8.8%), with average decoding errors in three triplets per participant. Study 2 analyzed the effect that triplets with decoding errors from Kaldi-NL had on the DIN test output (SRT), using bootstrapping simulations. Previous research indicated 0.70 dB as the typical within-subject SRT variability for normal-hearing adults. Study 2 showed that up to four triplets with decoding errors produce SRT variations within this range, suggesting that our proposed setup could be feasible for clinical applications.
引用
收藏
页数:13
相关论文
共 54 条
[1]  
Al-Rfou R, 2019, AAAI CONF ARTIF INTE, P3159
[2]  
[Anonymous], 2022, Kaldi NL
[3]  
Becerra A, 2016, PROCEEDINGS OF THE 2016 IEEE ANDESCON
[4]   Automatic speech recognition and speech variability: A review [J].
Benzeghiba, M. ;
De Mori, R. ;
Deroo, O. ;
Dupont, S. ;
Erbes, T. ;
Jouvet, D. ;
Fissore, L. ;
Laface, P. ;
Mertins, A. ;
Ris, C. ;
Rose, R. ;
Tyagi, V. ;
Wellekens, C. .
SPEECH COMMUNICATION, 2007, 49 (10-11) :763-786
[5]  
Besacier L., 2015, INTERSPEECH 2015
[6]  
Bezoui M, 2016, INT CONF MULTIMED, P127, DOI 10.1109/ICMCS.2016.7905619
[7]   On How Deaf People Might Use Speech to Control Devices [J].
Bigham, Jeffrey P. ;
Kushalnagar, Raja ;
Huang, Ting-Hao Kenneth ;
Pablo Flores, Juan ;
Savage, Saiph .
PROCEEDINGS OF THE 19TH INTERNATIONAL ACM SIGACCESS CONFERENCE ON COMPUTERS AND ACCESSIBILITY (ASSETS'17), 2017, :383-384
[8]  
Boothroyd A, 1968, Br J Audiol, V2, P3, DOI [DOI 10.3109/00381796809075436, 10.3109/00381796809075436]
[9]   Efficient adaptive procedures for threshold and concurrent slope estimates for psychophysics and speech intelligibility tests [J].
Brand, T ;
Kollmeier, B .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2002, 111 (06) :2801-2810
[10]  
Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105