REAL-TIME BINAURAL SPEECH SEPARATION WITH PRESERVED SPATIAL CUES

被引：0

作者：

Han, Cong ^{[1
]}

Luo, Yi ^{[1
]}

Mesgarani, Nima ^{[1
]}

机构：

[1] Columbia Univ, Dept Elect Engn, New York, NY 10027 USA

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年

基金：

美国国家科学基金会;

关键词：

Binaural speech separation; interaural cues; deep learning; real-time;

D O I：

10.1109/icassp40776.2020.9053215

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Deep learning speech separation algorithms have achieved great success in improving the quality and intelligibility of separated speech from mixed audio. Most previous methods focused on generating a single-channel output for each of the target speakers, hence discarding the spatial cues needed for the localization of sound sources in space. However, preserving the spatial information is important in many applications that aim to accurately render the acoustic scene such as in hearing aids and augmented reality (AR). Here, we propose a speech separation algorithm that preserves the interaural cues of separated sound sources and can be implemented with low latency and high fidelity, therefore enabling a real-time modification of the acoustic scene. Based on the time-domain audio separation network (TasNet), a single-channel time-domain speech separation system that can be implemented in real-time, we propose a multi-input-multi-output (MIMO) end-to-end extension of TasNet that takes binaural mixed audio as input and simultaneously separates target speakers in both channels. Experimental results show that the proposed end-to-end MIMO system is able to significantly improve the separation performance and keep the perceived location of the modified sources intact in various acoustic scenes.

引用

页码：6404 / 6408

页数：5

共 50 条

[31] Real-time spatial normalization for dynamic gesture classification
Zeghoud, Sofiane
Ali, Saba Ghazanfar
Ertugrul, Egemen
Kamel, Aouaidjia
Sheng, Bin
Li, Ping
Chi, Xiaoyu
Kim, Jinman
Mao, Lijuan
VISUAL COMPUTER, 2022, 38 (04) : 1345 - 1357
[32] Robust binaural speech separation in adverse conditions based on deep neural network with modified spatial features and training target
Dadvar, Paria
Geravanchizadeh, Masoud
SPEECH COMMUNICATION, 2019, 108 : 41 - 52
[33] Real-Time Codebook-based Speech Enhancement with GPUs
Prasanna, A. N. Sai
Gurumurthyt, Iver Chandrashekaran
Naidu, D. H. R.
Baruith, Pallav Kuniar
2014 INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING (PDGC), 2014, : 306 - 311
[34] Real-time spatial normalization for dynamic gesture classification
Sofiane Zeghoud
Saba Ghazanfar Ali
Egemen Ertugrul
Aouaidjia Kamel
Bin Sheng
Ping Li
Xiaoyu Chi
Jinman Kim
Lijuan Mao
The Visual Computer, 2022, 38 : 1345 - 1357
[35] Implementation of Real-Time Speech Separation Model Using Time-Domain Audio Separation Network (TasNet) and Dual-Path Recurrent Neural Network (DPRNN)
Wijayakusuma, Alfian
Gozali, Davin Reinaldo
Widjaja, Anthony
Ham, Hanry
5TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND COMPUTATIONAL INTELLIGENCE 2020, 2021, 179 : 762 - 772
[36] Enhancing the usability of real-time speech recognition captioning through personalised displays and real-time multiple speaker editing and annotation
Wald, Mike
Bain, Keith
UNIVERSAL ACCESS IN HUMAN-COMPUTER INTERACTION: APPLICATIONS AND SERVICES, PT 3, PROCEEDINGS, 2007, : 446 - +
[37] Real-time Single-channel Dereverberation and Separation with Time-domain Audio Separation Network
Luo, Yi
Mesgarani, Nima
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 342 - 346
[38] A Real-Time Dual-Microphone Speech Enhancement Algorithm Assisted by Bone Conduction Sensor
Zhou, Yi
Chen, Yufan
Ma, Yongbao
Liu, Hongqing
SENSORS, 2020, 20 (18) : 1 - 17
[39] Real-time semantic segmentation via mutual optimization of spatial details and semantic information
Ma, Mengyuan
Huang, Huiling
Han, Jun
Feng, Yanbing
Yang, Yi
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (03) : 6821 - 6834
[40] Learning Continuous Facial Actions From Speech for Real-Time Animation
Pham, Hai X.
Wang, Yuting
Pavlovic, Vladimir
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (03) : 1567 - 1580

← 1 2 3 4 5 →