Robust Speaker Diarization for Meetings: the ICSI system

1 561
52
Следующее
Популярные
14.02.23 – 1 9821:23:27
Automating Commonsense Reasoning
Опубликовано 7 сентября 2016, 16:26
The goal of speaker diarization is to determine where each participant speaks in a recording. Such information is extensively used in ASR systems (for example VTLN) or for speaker indexing systems, and is part of the ongoing Rich Transcription (RT) evaluations organized by NIST. In these evaluations all systems have to analyze the data without any prior knowledge on how many people there are in the recordings or who they are. In recent times there has been increasing interest in the speech/video analysis of the meeting environment (NIST's RT05s and Rt06s, AMI, CHIL and IM2 projects). In such meetings there are normally several microphones recording synchronously, either organized in microphone clusters or spread across the room in unknown locations. This must be taken into consideration when developing a diarization system for this domain. In this presentation I will cover the basics of what speaker diarization is based on and the main techniques used over the years. Then I will focus on the system implemented at the International Computer Science Institute (ICSI) for speaker diarization in the meeting environment. This system is based on a mono channel diarization system originally created for broadcast news diarization, with a preprocessing step based on the delay&sum algorithm that makes use of the multiple channels available for processing. This system is currently the state of the art for meetings diarization and it will shortly be available for download through my ICSI webpage.
автотехномузыкадетское