Should Machines Emulate Human Speech Recognition?

65
Опубликовано 6 сентября 2016, 17:02
Machine-based, automatic speech recognition (ASR) systems decode the acoustic signal by associating each time frame with a set of phonetic-segment possibilities. And from such matrices of segment probabilities, word hypotheses are formed. This segment-based, serial time-frame approach has been standard practice in ASR for many years. Although ASRΓÇÖs reliability has improved dramatically in recent years, such advances have often relied on huge amounts of training material and an expert team of developers. Might there be a simpler, faster way to develop ASR applications, one that adapts quickly to novel linguistic situations and challenging acoustic environments? It is the thesis of this presentation that future-generation ASR should be based (in part) on strategies used by human listeners to decode the speech signal. A comprehensive theoretical framework will be described, one based on a variety of perceptual, statistical and machine-learning studies. This Multi-Tier framework focuses on the interaction across different levels of linguistic organization. Words are composed of more than segments, and utterances consist of (far) more than words. In Multi-Tier Theory, the syllable serves as the interface between sound (as well as vision) and meaning. Units smaller than the syllable (such as the segment, and articulatory-acoustic features), combine with larger units (e.g., the lexeme and prosodic phrase) to provide a more balanced perspective than afforded by the conventional word/segment framework used in ASR. The presentation will consider (in some detail) how the brain decodes consonants, and how such knowledge can be used to deduce the perceptual flow of phonetic processing. The presentation will conclude with a discussion of how human speech-decoding strategies can (realistically) be used to improve the performance of automatic speech recognition (in machines).
автотехномузыкадетское