Distant Speech Recognition: No Black Boxes Allowed

390
65
Опубликовано 6 сентября 2016, 17:42
A complete system for distant speech recognition (DSR) typically consists of several distinct components. Among these are: o An array of microphone for far-field sound capture; o An algorithm for tracking the positions of the active speaker or speakers; o A beamforming algorithm for focusing on the desired speaker and suppressing noise, reverberation, and competing speech from other speakers; o A recognition engine to extract the most likely hypothesis from the output of the beamformer; o A speaker adaptation component for adapting to the characteristics of a given speaker as well as to channel effects; o Postfiltering to further enhance the beamformed output. Moreover, several of these components are comprised of one or more subcomponents. While it is tempting to isolate and optimize each component individually, experience has proven that such an approach cannot lead to optimal performance. In this talk, we will discuss several examples of the interactions between the individual components of a DSR system. In addition, we will describe the synergies that become possible as soon as each component is no longer treated as a ``black box''. To wit, instead of treating each component as having solely an input and an output, it is necessary to peal back the lid look inside. It is only then that it becomes apparent how the individual components of a DSR system can be viewed not as separate entities, but as the various organs of a complete body, and how optimal performance of such a system can be obtained. Joint work with: Kenichi Kumatani, Barbara Rauch, Friedrich Faubel, Matthias Wolfel, and Dietrich~Klakow
автотехномузыкадетское