Speaker Diarization: Optimal Clustering and Learning Speaker Embeddings

262
Опубликовано 22 июня 2016, 19:13
Speaker diarization consist of automatically partitioning an input audio stream into homogeneous segments (segmentation) and assigning these segments to the same speaker (speaker clustering). This process can allow to enhance the readability by structuring an audio document, or provide the speaker's true identity when it's used in conjunction with speaker recognition system. In this seminar I will talk about two new methods: ILP Clustering and Speaker embeddings. In speaker clustering, a major problem with using greedy agglomerative hierarchical clustering (HAC) is that it does not guarantee an optimal solution. I propose a new clustering model (called ILP Clustering), by redefining clustering problem as a linear program (ie. linear program is defined by an objective function and subject to linear equality and/or linear inequality constraint). Thus an Integer Linear Programming (ILP) solver can be used to search the optimal solution over the whole problem. In a second part, I propose to learn a set of high-level feature representations through deep learning, referred to as speaker embeddings. Speaker embedding features are taken from the hidden layer neuron activations of Deep Neural Networks (DNN), when learned as classifiers to recognize a thousand speaker identities in a training set. Although learned through identification, the speaker embeddings are shown to be effective for speaker verification in particular to recognize speakers' unseen in the training set. The experiments were conducted on the corpus of French broadcast news ETAPE where these new methods based on ILP/speaker-embeddings decreases DER by 4.79 points over the baseline diarization system based on HAC/GMM.
автотехномузыкадетское