Random Forests and the Data Sparseness Problem in Language Modeling

177
Следующее
06.09.16 – 261:12:25
WACE 2005 - Gaia
Популярные
Опубликовано 6 сентября 2016, 21:28
In this talk, we explore the use of Random Forests (RFs) in language modeling, the problem of predicting the next word based on words already seen. The goal in this work is to develop a new language model smoothing technique based on randomly grown Decision Trees (DTs) and interpolated Kneser-Ney smoothing. This new technique aims at solving the data sparseness problem in language modeling and it is complementary to many of the existing techniques. We study our RF approach in the context of n-gram type language modeling. Unlike regular n-gram language models, RF language models have the potential to generalize well to unseen data, even when histories are long (>4). We show that our RF language models are superior to interpolated Kneser-Ney n-gram models in reducing both the perplexity (PPL) and word error rate (WER) in large vocabulary speech recognition systems. The new technique developed in this work is general. We will show that it works well when combined with other techniques, including word clustering and the structured language model (SLM).
автотехномузыкадетское