Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis

被引:0
|
作者
P. Cabañas-Molero
M. Lucena
J. M. Fuertes
P. Vera-Candeas
N. Ruiz-Reyes
机构
[1] University of Jaén,Department of Telecommunication Engineering
[2] University of Jaén,Department of Computer Science
来源
Multimedia Tools and Applications | 2018年 / 77卷
关键词
Speaker diarization; Meeting rooms; SRP-PHAT; Multimodal processing;
D O I
暂无
中图分类号
学科分类号
摘要
Speaker diarization is traditionally defined as the problem of determining “who speaks when” given an audio or video stream. This is an important task in many applications for meeting rooms, including automatic transcription of conversations, camera steering or content summarization. When the room is equipped with microphone arrays and cameras, speakers can be distinguished according to their location and the problem can be addressed through localization techniques. This article proposes a multimodal speaker diarization system for meeting environments based on a modified SRP-PHAT function evaluated on space volumes rather than discrete points. In our system, this function is used in combination with a circular array, enabling audio-based localization based on the selection of local maxima. Voicing detection is used to detect speech frames, whereas video analysis is introduced to aid in the decision when users move or simultaneously speak. The approach is evaluated on the well-known AMI dataset with approximately 100 hours of realistic meeting recordings and shows an average diarization error rate of 21% – 25%.
引用
收藏
页码:27685 / 27707
页数:22
相关论文
共 4 条
  • [1] Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis
    Cabanas-Molero, P.
    Lucena, M.
    Fuertes, J. M.
    Vera-Candeas, P.
    Ruiz-Reyes, N.
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (20) : 27685 - 27707
  • [2] Multi-Speaker Direction of Arrival Estimation using SRP-PHAT Algorithm with a Weighted Histogram
    Hadad, Elior
    Gannot, Sharon
    2018 IEEE INTERNATIONAL CONFERENCE ON THE SCIENCE OF ELECTRICAL ENGINEERING IN ISRAEL (ICSEE), 2018,
  • [3] MULTIMODAL SPEAKER DIARIZATION OF REAL-WORLD MEETINGS USING D-VECTORS WITH SPATIAL FEATURES
    Kang, Wonjune
    Roy, Brandon C.
    Chow, Wesley
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6509 - 6513
  • [4] MULTI-MODAL SPEAKER DIARIZATION OF REAL-WORLD MEETINGS USING COMPRESSED-DOMAIN VIDEO FEATURES
    Friedland, Gerald
    Hung, Hayley
    Yeo, Chuohao
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4069 - +