mdctGAN: Taming transformer-based GAN for speech super-resolution with Modified DCT spectra

被引:2
|
作者
Shuai, Chenhao [1 ,3 ,4 ]
Shi, Chaohua [2 ,3 ,4 ]
Gan, Lu [3 ]
Liu, Hongqing [4 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Xidian Univ, Xian, Shaanxi, Peoples R China
[3] Brunel Univ London, London, England
[4] Chongqing Univ Posts & Telecommun, Chongqing, Peoples R China
来源
关键词
speech super-resolution; phase information; GAN;
D O I
10.21437/Interspeech.2023-113
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech super-resolution (SSR) aims to recover a high resolution (HR) speech from its corresponding low resolution (LR) counterpart. Recent SSR methods focus more on the reconstruction of the magnitude spectrogram, ignoring the importance of phase reconstruction, thereby limiting the recovery quality. To address this issue, we propose mdctGAN, a novel SSR framework based on modified discrete cosine transform (MDCT). By adversarial learning in the MDCT domain, our method reconstructs HR speeches in a phase-aware manner without vocoders or additional post-processing. Furthermore, by learning frequency consistent features with self-attentive mechanism, mdctGAN guarantees a high quality speech reconstruction. For VCTK corpus dataset, the experiment results show that our model produces natural auditory quality with high MOS and PESQ scores. It also achieves the state-of-the-art log-spectral-distance (LSD) performance on 48 kHz target resolution from various input rates. Code is available from https://github.com/neoncloud/mdctGAN
引用
收藏
页码:5112 / 5116
页数:5
相关论文
共 50 条
  • [21] DCT-GAN: Dilated Convolutional Transformer-Based GAN for Time Series Anomaly Detection
    Li, Yifan
    Peng, Xiaoyan
    Zhang, Jia
    Li, Zhiyong
    Wen, Ming
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (04) : 3632 - 3644
  • [22] Time-Domain Speech Super-Resolution With GAN Based Modeling for Telephony Speaker Verification
    Kataria, Saurabh
    Villalba, Jesus
    Moro-Velazquez, Laureano
    Zelasko, Piotr
    Dehak, Najim
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1736 - 1749
  • [23] Super-resolution Using DCT Based Learning With LBP as Feature Model
    Pithadia, Parul V.
    Gajjar, Prakash P.
    Dave, J. V.
    2012 THIRD INTERNATIONAL CONFERENCE ON COMPUTING COMMUNICATION & NETWORKING TECHNOLOGIES (ICCCNT), 2012,
  • [24] StyleSwin: Transformer-based GAN for High-resolution Image Generation
    Zhang, Bowen
    Gu, Shuyang
    Zhang, Bo
    Bao, Jianmin
    Chen, Dong
    Wen, Fang
    Wang, Yong
    Guo, Baining
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11294 - 11304
  • [25] DCT-FANet: DCT based frequency attention network for single image super-resolution
    Xu, Ruyu
    Kang, Xuejing
    Li, Chunxiao
    Chen, Hong
    Ming, Anlong
    DISPLAYS, 2022, 74
  • [26] Lightweight Wavelet-Based Transformer for Image Super-Resolution
    Ran, Jinye
    Zhang, Zili
    PRICAI 2022: TRENDS IN ARTIFICIAL INTELLIGENCE, PT III, 2022, 13631 : 368 - 382
  • [27] Efficient image super-resolution based on transformer with bidirectional interaction
    Gendy, Garas
    He, Guanghui
    Sabor, Nabil
    APPLIED SOFT COMPUTING, 2024, 165
  • [28] The Method of Industrial Internet Image Super-resolution Based on Transformer
    Liu, Lin
    Yu, Yingjie
    Wang, Juncheng
    Jin, Yi
    Zeng, Yuqiao
    2022 16TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP2022), VOL 1, 2022, : 260 - 265
  • [29] Structured image super-resolution network based on improved Transformer
    Lv X.-D.
    Li J.
    Deng Z.-N.
    Feng H.
    Cui X.-T.
    Deng H.-X.
    Zhejiang Daxue Xuebao (Gongxue Ban)/Journal of Zhejiang University (Engineering Science), 2023, 57 (05): : 865 - 874+910
  • [30] Enhancing the Spatial Resolution of Sentinel-2 Images Through Super-Resolution Using Transformer-Based Deep-Learning Models
    Sharifi, Alireza
    Safari, Mohammad Mahdi
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2025, 18 : 4805 - 4820