Microsoft Research334 тыс
Опубликовано 27 июня 2016, 18:13
While significant progress has been made in automatic speech recognition (ASR) during the last few decades, recognizing and understanding unconstrained conversational speech remains a challenging problem. Unlike read or highly constrained speech, spontaneous conversational speech is often ungrammatical and ill-structured. As the relevant semantic notions are embedded in the set of keywords, the first goal is to propose a model training methodology for keyword spotting. Non-uniform minimum classification error (MCE) approach is proposed which can achieve consistent and significant performance gains on both English and Mandarin large-scale spontaneous conversational speech (Switchboard, HKUST). Adverse acoustical environments degrade the system performance substantially. Recently, acoustic models based on deep neural networks (DNNs) have shown great success. This opens new possibilities for further improving the noise robustness in recognizing the conversational speech. The second goal is to propose a DNN based acoustic model that is robust to additive noise, channel distortions, interference of competing talkers. Hybrid recurrent DNN-HMM system is proposed for robust acoustic modeling which achieves state-of-the-art performances on two benchmark datasets (Aurora-4, CHiME). To study the specific case of conversational speech recognition in the presence of competing talker, several multi-style training setups of DNNs are investigated and a joint decoder operating on multi-talker speech is introduced. The proposed combined system outperforms the state-of-the-art 2006 IBM superhuman system on the same benchmark dataset. Even with a perfect ASR, extracting semantic notions from conversational speech can be challenging due to the interference of frequently uttered disfluencies, filler and mispronounced words, etc. The third goal is to propose a robust WFST based semantic decoder seamlessly interfacing with ASR. Latent semantic rational kernels (LSRKs) are proposed and substantial topic spotting performance gains are achieved on two conversational speech tasks (Switchboard, HMIHY0300).