Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis

被引：0

作者：

P. Cabañas-Molero

M. Lucena

J. M. Fuertes

P. Vera-Candeas

N. Ruiz-Reyes

机构：

[1] University of Jaén,Department of Telecommunication Engineering

[2] University of Jaén,Department of Computer Science

来源：

Multimedia Tools and Applications | 2018年 / 77卷

关键词：

Speaker diarization; Meeting rooms; SRP-PHAT; Multimodal processing;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Speaker diarization is traditionally defined as the problem of determining “who speaks when” given an audio or video stream. This is an important task in many applications for meeting rooms, including automatic transcription of conversations, camera steering or content summarization. When the room is equipped with microphone arrays and cameras, speakers can be distinguished according to their location and the problem can be addressed through localization techniques. This article proposes a multimodal speaker diarization system for meeting environments based on a modified SRP-PHAT function evaluated on space volumes rather than discrete points. In our system, this function is used in combination with a circular array, enabling audio-based localization based on the selection of local maxima. Voicing detection is used to detect speech frames, whereas video analysis is introduced to aid in the decision when users move or simultaneously speak. The approach is evaluated on the well-known AMI dataset with approximately 100 hours of realistic meeting recordings and shows an average diarization error rate of 21% – 25%.

引用

页码：27685 / 27707

页数：22

共 4 条

[1] Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis
Cabanas-Molero, P.
Lucena, M.
Fuertes, J. M.
Vera-Candeas, P.
Ruiz-Reyes, N.
MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (20) : 27685 - 27707
[2] Multi-Speaker Direction of Arrival Estimation using SRP-PHAT Algorithm with a Weighted Histogram
Hadad, Elior
Gannot, Sharon
2018 IEEE INTERNATIONAL CONFERENCE ON THE SCIENCE OF ELECTRICAL ENGINEERING IN ISRAEL (ICSEE), 2018,
[3] MULTIMODAL SPEAKER DIARIZATION OF REAL-WORLD MEETINGS USING D-VECTORS WITH SPATIAL FEATURES
Kang, Wonjune
Roy, Brandon C.
Chow, Wesley
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6509 - 6513
[4] MULTI-MODAL SPEAKER DIARIZATION OF REAL-WORLD MEETINGS USING COMPRESSED-DOMAIN VIDEO FEATURES
Friedland, Gerald
Hung, Hayley
Yeo, Chuohao
2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4069 - +

← 1 →