Investigating Low-Distortion Speech Enhancement with Discrete Cosine Transform Features for Robust Speech Recognition

被引:0
作者
Tsao, Yu-Sheng [1 ]
Hung, Jeih-Weih [2 ]
Ho, Kuan-Hsun [1 ]
Chen, Berlin [1 ]
机构
[1] Natl Taiwan Normal Univ, Taipei, Taiwan
[2] Natl Chi Nan Univ, Nantou, Taiwan
来源
PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) | 2022年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study investigates constructing low-distortion utterances to benefit downstream automatic speech recognition (ASR) systems at the front-end stage based on a speech enhancement (SE) network. With the dual-path Transformer network (DPTNet) as the SE archetype, we make effective use of short-time discrete cosine transform (STDCT) features to infer the respective mask-estimation network. Furthermore, we seek to jointly optimize the spectral-distance loss and the perceptual loss for the training of the model components of our proposed SE model so as to enhance the input utterances without introducing significant distortion. Extensive evaluation experiments are conducted on the VoiceBank-DEMAND and VoiceBank-QUT tasks, containing stationary and non-stationary noises, respectively. The corresponding results show that the proposed SE method yields competitive perceptual metric scores on SE but significantly lower word error rates (WER) on ASR in relation to several top-of-the-line methods. Notably, the proposed SE method works remarkably well on the VoiceBank-QUT ASR task, thereby confirming its excellent generalization capability to unseen scenarios.
引用
收藏
页码:131 / 136
页数:6
相关论文
共 24 条
  • [1] A consolidated view of loss functions for supervised deep learning-based speech enhancement
    Braun, Sebastian
    Tashev, Ivan
    [J]. 2021 44TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2021, : 72 - 76
  • [2] TENET: A TIME-REVERSAL ENHANCEMENT NETWORK FOR NOISE-ROBUST ASR
    Chao, Fu-An
    Jiang, Shao-Wei Fan
    Yan, Bi-Cheng
    Hung, Jeih-weih
    Chen, Berlin
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 55 - 61
  • [3] Chen JJ, 2020, Arxiv, DOI arXiv:2007.13975
  • [4] Chuang Geng, 2020, 2020 Proceedings of IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), P379, DOI 10.1109/ICAICA50127.2020.9182513
  • [5] Dean D, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P3110
  • [6] Defossez A., 2020, arXiv, DOI DOI 10.48550/ARXIV.2006.12847
  • [7] Fu SW, 2021, Arxiv, DOI arXiv:2104.03538
  • [8] George M. A. C., 2002, DIGIT SIGNAL PROCESS
  • [9] Heitkaemper J, 2020, INT CONF ACOUST SPEE, P6359, DOI [10.1109/icassp40776.2020.9052981, 10.1109/ICASSP40776.2020.9052981]
  • [10] Hsieh T.-A., 2020, arXiv