End-to-end speech recognition modeling from de-identified data

被引:3
作者
Flechl, Martin [1 ]
Yin, Shou-Chun [1 ]
Park, Junho [1 ]
Skala, Peter [1 ]
机构
[1] Nuance Commun Inc, Burlington, MA 01803 USA
来源
INTERSPEECH 2022 | 2022年
关键词
speech recognition; ASR; end-to-end; de-identification; privacy; conformer; transducer; text-to-speech;
D O I
10.21437/Interspeech.2022-10484
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
De-identification of data used for automatic speech recognition modeling is a critical component in protecting privacy, especially in the medical domain. However, simply removing all personally identifiable information (PII) from end-to-end model training data leads to a significant performance degradation in particular for the recognition of names, dates, locations, and words from similar categories. We propose and evaluate a two-step method for partially recovering this loss. First, PII is identified, and each occurrence is replaced with a random word sequence of the same category. Then, corresponding audio is produced via text-to-speech or by splicing together matching audio fragments extracted from the corpus. These artificial audio/label pairs, together with speaker turns from the original data without PII, are used to train models. We evaluate the performance of this method on in-house data of medical conversations and observe a recovery of almost the entire performance degradation in the general word error rate while still maintaining a strong diarization performance. Our main focus is the improvement of recall and precision in the recognition of PII-related words. Depending on the PII category, between 50%-90% of the performance degradation can be recovered using our proposed method.
引用
收藏
页码:1382 / 1386
页数:5
相关论文
共 25 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]   Privacy Guarantees for De-identifying Text Transformations [J].
Adelani, David Ifeoluwa ;
Davody, Ali ;
Kleinbauer, Thomas ;
Klakow, Dietrich .
INTERSPEECH 2020, 2020, :4666-4670
[3]  
[Anonymous], PROC
[4]  
Cerence Inc, CER TTS
[5]  
Chan William, 2015, CoRR, V2, P5
[6]  
Dernoncourt F., 1975, J AM MED INFORM ASSN, V24
[7]   Joint Speech Recognition and Speaker Diarization via Sequence Transduction [J].
El Shafey, Laurent ;
Soltau, Hagen ;
Shafran, Izhak .
INTERSPEECH 2019, 2019, :396-400
[8]   Conformer: Convolution-augmented Transformer for Speech Recognition [J].
Gulati, Anmol ;
Qin, James ;
Chiu, Chung-Cheng ;
Parmar, Niki ;
Zhang, Yu ;
Yu, Jiahui ;
Han, Wei ;
Wang, Shibo ;
Zhang, Zhengdong ;
Wu, Yonghui ;
Pang, Ruoming .
INTERSPEECH 2020, 2020, :5036-5040
[9]  
Khin K., 2018, P WORKSH INF TECHN S
[10]  
Kim J., 2021, P 38 INT C MACH LEAR