MULTIMODAL SPEAKER DIARIZATION OF REAL-WORLD MEETINGS USING D-VECTORS WITH SPATIAL FEATURES

被引：0

作者：

Kang, Wonjune ^{[1
]}

Roy, Brandon C. ^{[2
,3
]}

Chow, Wesley ^{[2
,3
]}

机构：

[1] MIT, Cambridge, MA 02139 USA

[2] MIT, Media Lab, Cambridge, MA 02139 USA

[3] Cortico, Boston, MA USA

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年

关键词：

Speaker diarization; d-vector; beamforming; sound source localization; spectral clustering;

D O I：

10.1109/icassp40776.2020.9053122

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Deep neural network based audio embeddings (d-vectors) have demonstrated superior performance in audio-only speaker diarization compared to traditional acoustic features such as mel-frequency cepstral coefficients (MFCCs) and i-vectors. However, there has been little work on multimodal diarization systems that combine d-vectors with additional sources of information. In this paper, we present a novel approach to multimodal speaker diarization that combines d-vectors with spatial information derived from performing beamforming given a multi-channel microphone array. Our system performs spectral clustering on a combination of speaker embeddings and spatial features that are computed using the Steered-Response Power Phase Transform (SRP-PHAT) algorithm. We evaluate our system on the AMI Meeting Corpus and an internal dataset of real-world conversations. By using both acoustic and spatial features for diarization, we achieve significant improvements over a d-vector only baseline and show potential to achieve comparable results with other state-of-the-art multimodal diarization systems.

引用

页码：6509 / 6513

页数：5

共 50 条

[1] MULTI-MODAL SPEAKER DIARIZATION OF REAL-WORLD MEETINGS USING COMPRESSED-DOMAIN VIDEO FEATURES
Friedland, Gerald
Hung, Hayley
Yeo, Chuohao
2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4069 - +
[2] MULTI-CHANNEL SPEAKER DIARIZATION USING SPATIAL FEATURES FOR MEETINGS
Zheng, Naijun
Li, Na
Yu, JianWei
Weng, Chao
Su, Dan
Liu, XunYing
Meng, Helen
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7337 - 7341
[3] MODELING AUDIO DIRECTIONAL STATISTICS USING A PROBABILISTIC SPATIAL DICTIONARY FOR SPEAKER DIARIZATION IN REAL MEETINGS
Fakhry, Mahmoud
Ito, Nobutaka
Araki, Shoko
Nakatani, Tomohiro
2016 IEEE INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), 2016,
[4] Speaker adaptation in DNN-based speech synthesis using d-vectors
Doddipatla, Rama
Braunschweiler, Norbert
Maia, Ranniery
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3404 - 3408
[5] Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis
P. Cabañas-Molero
M. Lucena
J. M. Fuertes
P. Vera-Candeas
N. Ruiz-Reyes
Multimedia Tools and Applications, 2018, 77 : 27685 - 27707
[6] Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis
Cabanas-Molero, P.
Lucena, M.
Fuertes, J. M.
Vera-Candeas, P.
Ruiz-Reyes, N.
MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (20) : 27685 - 27707
[7] Who said that?: Audio-visual speaker diarisation of real-world meetings
Chung, Joon Son
Lee, Bong-Jin
Han, Icksang
INTERSPEECH 2019, 2019, : 371 - 375
[8] ADAPTING SPEECH SEPARATION TO REAL-WORLD MEETINGS USING MIXTURE INVARIANT TRAINING
Sivaraman, Aswin
Wisdom, Scott
Erdogan, Hakan
Hershey, John R.
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 686 - 690
[9] Performance of Spatial Modulation using Measured Real-World Channels
Younis, A.
Thompson, W.
Di Renzo, M.
Wang, C. -X.
Beach, M. A.
Haas, H.
Grant, P. M.
2013 IEEE 78TH VEHICULAR TECHNOLOGY CONFERENCE (VTC FALL), 2013,
[10] Recognizing Real-World Intentions using A Multimodal Deep Learning Approach with Spatial-Temporal Graph Convolutional Networks
Shi, Jiaqi
Liu, Chaoran
Ishi, Carlos Toshinori
Wu, Bowen
Ishiguro, Hiroshi
2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 3819 - 3826

← 1 2 3 4 5 →