End-to-End Paired Ambisonic-Binaural Audio Rendering

被引：1

作者：

Zhu, Yin ^{[1
]}

Kong, Qiuqiang ^{[2
,3
]}

Shi, Junjie ^{[2
,3
]}

Liu, Shilei ^{[2
,3
]}

Ye, Xuzhou ^{[2
,3
]}

Wang, Ju-Chiang ^{[2
,3
]}

Shan, Hongming ^{[4
,5
,6
]}

Zhang, Junping ^{[1
]}

机构：

[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Intelligent Informat Proc, Shanghai 200433, Peoples R China

[2] Beijing ByteDance Technol Co Ltd, Shanghai 201102, Peoples R China

[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[4] Fudan Univ, Inst Sci & Technol Brain Inspired Intelligence, Shanghai 200433, Peoples R China

[5] Fudan Univ, MOE Frontiers Ctr Brain Sci, Shanghai 200433, Peoples R China

[6] Shanghai Ctr Brain Sci & Brain Inspired Technol, Shanghai 200031, Peoples R China

来源：

IEEE-CAA JOURNAL OF AUTOMATICA SINICA | 2024年 / 11卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Measurement; Costs; Neural networks; Virtual reality; Rendering (computer graphics); Task analysis; Optimization; Ambisonic; attention; binaural rendering; neural network;

D O I：

10.1109/JAS.2023.123969

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Binaural rendering is of great interest to virtual reality and immersive media. Although humans can naturally use their two ears to perceive the spatial information contained in sounds, it is a challenging task for machines to achieve binaural rendering since the description of a sound field often requires multiple channels and even the metadata of the sound sources. In addition, the perceived sound varies from person to person even in the same sound field. Previous methods generally rely on individual-dependent head-related transferred function (HRTF) datasets and optimization algorithms that act on HRTFs. In practical applications, there are two major drawbacks to existing methods. The first is a high personalization cost, as traditional methods achieve personalized needs by measuring HRTFs. The second is insufficient accuracy because the optimization goal of traditional methods is to retain another part of information that is more important in perception at the cost of discarding a part of the information. Therefore, it is desirable to develop novel techniques to achieve personalization and accuracy at a low cost. To this end, we focus on the binaural rendering of ambisonic and propose 1) channel-shared encoder and channel-compared attention integrated into neural networks and 2) a loss function quantifying interaural level differences to deal with spatial information. To verify the proposed method, we collect and release the first paired ambisonic-binaural dataset and introduce three metrics to evaluate the content information and spatial information accuracy of the end-to-end methods. Extensive experimental results on the collected dataset demonstrate the superior performance of the proposed method and the shortcomings of previous methods.

引用

页码：502 / 513

页数：12

共 50 条

[31] End-to-end audio-visual speech recognition for overlapping speech
Rose, Richard
Siohan, Olivier
Tripathi, Anshuman
Braga, Otavio
INTERSPEECH 2021, 2021, : 3016 - 3020
[32] MODELING NONLINEAR AUDIO EFFECTS WITH END-TO-END DEEP NEURAL NETWORKS
Ramirez, Marco A. Martinez
Reiss, Joshua D.
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 171 - 175
[33] An Improved End-to-End Audio-Visual Speech Recognition Model
Yang, Sheng
Gong, Zheng
Kang, Jia
INTERSPEECH 2023, 2023, : 3093 - 3097
[34] MODALITY ATTENTION FOR END-TO-END AUDIO-VISUAL SPEECH RECOGNITION
Zhou, Pan
Yang, Wenwen
Chen, Wei
Wang, Yanfeng
Jia, Jia
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6565 - 6569
[35] Long Audio File Speaker Diarization with Feasible End-to-End Models
Huang, Kai-Wei
Chen, Chia-Ping
2024 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2024,
[36] End-to-end Stereo Audio Coding Using Deep Neural Networks
Lim, Wootaek
Jang, Inseon
Beack, Seungkwon
Sung, Jongmo
Lee, Taejin
PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 860 - 864
[37] The end of end-to-end?
Garfinkel, S
TECHNOLOGY REVIEW, 2003, 106 (06) : 30 - 30
[38] End-to-end consensus using end-to-end channels
Wiesmann, Matthias
Defago, Xavier
12TH PACIFIC RIM INTERNATIONAL SYMPOSIUM ON DEPENDABLE COMPUTING, PROCEEDINGS, 2006, : 341 - +
[39] AN END-TO-END DEEP LEARNING FRAMEWORK FOR MULTIPLE AUDIO SOURCE SEPARATION AND LOCALIZATION
Chen, Yu
Liu, Bowen
Zhang, Zijian
Kim, Hun-Seok
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 736 - 740
[40] End-to-end Visual-guided Audio Source Separation with Enhanced Losses
Pham, Duc-Huy
Do, Quang-Anh
Duong, Thanh Thi-Hien
Le, Thi-Lan
Nguyen, Phi-Le
PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 2022 - 2028

← 1 2 3 4 5 →