Recognizing a Million Voices: Low Dimensional Audio Representations for Speaker Identification

6 125
30
Опубликовано 17 августа 2016, 1:11
Recent advances in speaker verification technology have resulted in dramatic performance improvements in both speed and accuracy. Over the past few years, error rates have decreased by a factor of 5 or more. At the same time, the new techniques have resulted in massive speed-ups, which have increased the scale of viable speaker-id systems by several orders of magnitude. These improvements stem from a recent shift in the speaker modeling paradigm. Only a few years ago, the model for each individual speaker was trained using data from only that particular speaker. Now, we make use of large speaker-labeled databases to learn distributions describing inter- and intra-speaker variability. This allow us to reveal the speech characteristics that are important for discriminating between speakers. During the 2008 JHU summer workshop, our team has found that speech utterances can be encoded into low dimensional fixed-length vectors that preserve information about speaker identity. This concept of so-called 'i-vectors', which now forms the basis of state-of-the-art systems, enabled new machine learning approaches to be applied to the speaker identification problem. Inter- and intra-speaker variability can now be easily modeled using Bayesian approaches, which leads to superior performance. A new training strategies can now benefit form the simpler statistical model form and the inherent speed-up. In our most recent work, we have retrained the hyperparameters of our Bayesian model using a discriminative objective function that directly addresses the task in speaker verification: discrimination between same-speaker and different-speaker trials. This is the first time such discriminative training has been successfully applied to speaker verification task.
автотехномузыкадетское