Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition
被引:1
作者:
Ravenscroft, William
论文数: 0引用数: 0
h-index: 0
机构:
Univ Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England
Solventum, St Paul, MN USAUniv Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England
Ravenscroft, William
[1
,2
]
Close, George
论文数: 0引用数: 0
h-index: 0
机构:
Univ Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, EnglandUniv Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England
Close, George
[1
]
Goetze, Stefan
论文数: 0引用数: 0
h-index: 0
机构:
Univ Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, EnglandUniv Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England
Goetze, Stefan
[1
]
Hain, Thomas
论文数: 0引用数: 0
h-index: 0
机构:
Univ Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, EnglandUniv Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England
Hain, Thomas
[1
]
Soleymanpour, Mohammad
论文数: 0引用数: 0
h-index: 0
机构:
Solventum, St Paul, MN USAUniv Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England
Soleymanpour, Mohammad
[2
]
Chowdhury, Anurag
论文数: 0引用数: 0
h-index: 0
机构:
Solventum, St Paul, MN USAUniv Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England
Chowdhury, Anurag
[2
]
Fuhs, Mark C.
论文数: 0引用数: 0
h-index: 0
机构:
Solventum, St Paul, MN USAUniv Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England
Fuhs, Mark C.
[2
]
机构:
[1] Univ Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England
One solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domain audio where reference transcript information is not always available. This paper proposes a transcription-free method for joint training using only audio signals. The proposed method uses embedding differences of pre-trained ASR encoders as a loss with a proposed modification to permutation invariant training (PIT) called guided PIT (GPIT). The method achieves a 6.4% improvement in word error rate (WER) measures over a signal-level loss and also shows enhancement improvements in perceptual measures such as short-time objective intelligibility (STOI).