MULTIMODAL SPEAKER DIARIZATION OF REAL-WORLD MEETINGS USING D-VECTORS WITH SPATIAL FEATURES

被引:0
|
作者
Kang, Wonjune [1 ]
Roy, Brandon C. [2 ,3 ]
Chow, Wesley [2 ,3 ]
机构
[1] MIT, Cambridge, MA 02139 USA
[2] MIT, Media Lab, Cambridge, MA 02139 USA
[3] Cortico, Boston, MA USA
关键词
Speaker diarization; d-vector; beamforming; sound source localization; spectral clustering;
D O I
10.1109/icassp40776.2020.9053122
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Deep neural network based audio embeddings (d-vectors) have demonstrated superior performance in audio-only speaker diarization compared to traditional acoustic features such as mel-frequency cepstral coefficients (MFCCs) and i-vectors. However, there has been little work on multimodal diarization systems that combine d-vectors with additional sources of information. In this paper, we present a novel approach to multimodal speaker diarization that combines d-vectors with spatial information derived from performing beamforming given a multi-channel microphone array. Our system performs spectral clustering on a combination of speaker embeddings and spatial features that are computed using the Steered-Response Power Phase Transform (SRP-PHAT) algorithm. We evaluate our system on the AMI Meeting Corpus and an internal dataset of real-world conversations. By using both acoustic and spatial features for diarization, we achieve significant improvements over a d-vector only baseline and show potential to achieve comparable results with other state-of-the-art multimodal diarization systems.
引用
收藏
页码:6509 / 6513
页数:5
相关论文
共 50 条
  • [1] MULTI-MODAL SPEAKER DIARIZATION OF REAL-WORLD MEETINGS USING COMPRESSED-DOMAIN VIDEO FEATURES
    Friedland, Gerald
    Hung, Hayley
    Yeo, Chuohao
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4069 - +
  • [2] MULTI-CHANNEL SPEAKER DIARIZATION USING SPATIAL FEATURES FOR MEETINGS
    Zheng, Naijun
    Li, Na
    Yu, JianWei
    Weng, Chao
    Su, Dan
    Liu, XunYing
    Meng, Helen
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7337 - 7341
  • [3] MODELING AUDIO DIRECTIONAL STATISTICS USING A PROBABILISTIC SPATIAL DICTIONARY FOR SPEAKER DIARIZATION IN REAL MEETINGS
    Fakhry, Mahmoud
    Ito, Nobutaka
    Araki, Shoko
    Nakatani, Tomohiro
    2016 IEEE INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), 2016,
  • [4] Speaker adaptation in DNN-based speech synthesis using d-vectors
    Doddipatla, Rama
    Braunschweiler, Norbert
    Maia, Ranniery
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3404 - 3408
  • [5] Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis
    P. Cabañas-Molero
    M. Lucena
    J. M. Fuertes
    P. Vera-Candeas
    N. Ruiz-Reyes
    Multimedia Tools and Applications, 2018, 77 : 27685 - 27707
  • [6] Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis
    Cabanas-Molero, P.
    Lucena, M.
    Fuertes, J. M.
    Vera-Candeas, P.
    Ruiz-Reyes, N.
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (20) : 27685 - 27707
  • [7] Who said that?: Audio-visual speaker diarisation of real-world meetings
    Chung, Joon Son
    Lee, Bong-Jin
    Han, Icksang
    INTERSPEECH 2019, 2019, : 371 - 375
  • [8] ADAPTING SPEECH SEPARATION TO REAL-WORLD MEETINGS USING MIXTURE INVARIANT TRAINING
    Sivaraman, Aswin
    Wisdom, Scott
    Erdogan, Hakan
    Hershey, John R.
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 686 - 690
  • [9] Performance of Spatial Modulation using Measured Real-World Channels
    Younis, A.
    Thompson, W.
    Di Renzo, M.
    Wang, C. -X.
    Beach, M. A.
    Haas, H.
    Grant, P. M.
    2013 IEEE 78TH VEHICULAR TECHNOLOGY CONFERENCE (VTC FALL), 2013,
  • [10] Recognizing Real-World Intentions using A Multimodal Deep Learning Approach with Spatial-Temporal Graph Convolutional Networks
    Shi, Jiaqi
    Liu, Chaoran
    Ishi, Carlos Toshinori
    Wu, Bowen
    Ishiguro, Hiroshi
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 3819 - 3826