SPEECH INTELLIGIBILITY ENHANCEMENT USING NON-PARALLEL SPEAKING STYLE CONVERSION WITH STARGAN AND DYNAMIC RANGE COMPRESSION

被引:0
作者
Li, Gang [1 ,2 ]
Hu, Ruimin [1 ,2 ]
Ke, Shanfa [1 ]
Zhang, Rui [1 ]
Wang, Xiaochen [1 ,3 ]
Gao, Li [1 ]
机构
[1] Wuhan Univ, Natl Engn Res Ctr Multimedia Software, Sch Comp Sci, Wuhan, Hubei, Peoples R China
[2] Wuhan Univ, Hubei Key Lab Multimedia & Network Commun Engn, Wuhan, Hubei, Peoples R China
[3] Wuhan Univ Shenzhen, Res Inst, Shenzhen, Peoples R China
来源
2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME) | 2020年
关键词
speech intelligibility; Lombard effect; speaking style conversion (SSC); StarGAN; dynamic range compression (DRC); LOMBARD SPEECH; VOCODER; NOISE;
D O I
10.1109/icme46284.2020.9102916
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Speech intelligibility enhancement is a perceptual enhancement technique for clean speech reproduced in noisy environments. It is typically used in the listening stage of multimedia communications. In this study, we enhance speech intelligibility by speaking style conversion (SSC), which is a datadriven approach inspired by a vocal mechanism named Lombard effect. The proposed SSC method combines star generative adversarial network (StarGAN) based mapping and dynamic range compression (DRC). It has two main advantages: 1) different from gender-independent conversion in previous studies, StarGAN can separately learn speech features of different genders to provide a differential conversion among genders with a single model and non-parallel training data; 2) we design a multi-level enhancement strategy with the use of DRC in the StarGAN architecture, which improves the SSC performance in strong noise interference. Experiments show that our method outperforms baseline methods.
引用
收藏
页数:6
相关论文
共 19 条
  • [1] A corpus of audio-visual Lombard speech with frontal and profile views
    Alghamdi, Najwa
    Maddock, Steve
    Marxer, Ricard
    Barker, Jon
    Brown, Guy J.
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2018, 143 (06) : EL523 - EL529
  • [2] [Anonymous], 1996, ITU T RECOMMENDATION
  • [3] StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation
    Choi, Yunjey
    Choi, Minje
    Kim, Munyoung
    Ha, Jung-Woo
    Kim, Sunghun
    Choo, Jaegul
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8789 - 8797
  • [4] Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?
    Garnier, Maeva
    Henrich, Nathalie
    [J]. COMPUTER SPEECH AND LANGUAGE, 2014, 28 (02) : 580 - 597
  • [5] Jokinen E., 2014, P INTERSPEECH, P2036
  • [6] STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds
    Kawahara, Hideki
    [J]. ACOUSTICAL SCIENCE AND TECHNOLOGY, 2006, 27 (06) : 349 - 353
  • [7] Optimizing Speech Intelligibility in a Noisy Environment
    Kleijn, W. Bastiaan
    Crespo, Joao B.
    Hendriks, Richard C.
    Petkov, Petko N.
    Sauert, Bastian
    Vary, Peter
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2015, 32 (02) : 43 - 54
  • [8] High quality voice conversion through phoneme-based linear mapping functions with STRAIGHT for mandarin
    Liu, Kun
    Zhang, Jianping
    Yan, Yonghong
    [J]. FOURTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 4, PROCEEDINGS, 2007, : 410 - 414
  • [9] Speaking style conversion from normal to Lombard speech using a glottal vocoder and Bayesian GMMs
    Lopez, Ana Ramirez
    Seshadri, Shreyas
    Juvela, Lauri
    Rasanen, Okko
    Alku, Paavo
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1363 - 1367
  • [10] WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications
    Morise, Masanori
    Yokomori, Fumiya
    Ozawa, Kenji
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (07): : 1877 - 1884