End-to-end speech recognition modeling from de-identified data

被引：3

作者：

Flechl, Martin ^{[1
]}

Yin, Shou-Chun ^{[1
]}

Park, Junho ^{[1
]}

Skala, Peter ^{[1
]}

机构：

[1] Nuance Commun Inc, Burlington, MA 01803 USA

来源：

INTERSPEECH 2022 | 2022年

关键词：

speech recognition; ASR; end-to-end; de-identification; privacy; conformer; transducer; text-to-speech;

D O I：

10.21437/Interspeech.2022-10484

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

De-identification of data used for automatic speech recognition modeling is a critical component in protecting privacy, especially in the medical domain. However, simply removing all personally identifiable information (PII) from end-to-end model training data leads to a significant performance degradation in particular for the recognition of names, dates, locations, and words from similar categories. We propose and evaluate a two-step method for partially recovering this loss. First, PII is identified, and each occurrence is replaced with a random word sequence of the same category. Then, corresponding audio is produced via text-to-speech or by splicing together matching audio fragments extracted from the corpus. These artificial audio/label pairs, together with speaker turns from the original data without PII, are used to train models. We evaluate the performance of this method on in-house data of medical conversations and observe a recovery of almost the entire performance degradation in the general word error rate while still maintaining a strong diarization performance. Our main focus is the improvement of recall and precision in the recognition of PII-related words. Depending on the PII category, between 50%-90% of the performance degradation can be recovered using our proposed method.

引用

页码：1382 / 1386

页数：5

共 25 条

[1]

Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265

[2] Privacy Guarantees for De-identifying Text Transformations [J].

Adelani, David Ifeoluwa ;

Davody, Ali ;

Kleinbauer, Thomas ;

Klakow, Dietrich .

INTERSPEECH 2020, 2020, :4666-4670

[3]

[Anonymous], PROC

[4]

Cerence Inc, CER TTS

[5]

Chan William, 2015, CoRR, V2, P5

[6]

Dernoncourt F., 1975, J AM MED INFORM ASSN, V24

[7] Joint Speech Recognition and Speaker Diarization via Sequence Transduction [J].

El Shafey, Laurent ;

Soltau, Hagen ;

Shafran, Izhak .

INTERSPEECH 2019, 2019, :396-400

[8] Conformer: Convolution-augmented Transformer for Speech Recognition [J].

Gulati, Anmol ;

Qin, James ;

Chiu, Chung-Cheng ;

Parmar, Niki ;

Zhang, Yu ;

Yu, Jiahui ;

Han, Wei ;

Wang, Shibo ;

Zhang, Zhengdong ;

Wu, Yonghui ;

Pang, Ruoming .

INTERSPEECH 2020, 2020, :5036-5040

[9]

Khin K., 2018, P WORKSH INF TECHN S

[10]

Kim J., 2021, P 38 INT C MACH LEAR

← 1 2 3 →