Adding Domain Knowledge to Latent Topic Models

260
Опубликовано 17 августа 2016, 2:31
Around the turn of the century, a favorite pastime in machine learning was to inject various forms of domain knowledge into clustering. Examples include the must-links, where two items must be in the same cluster, and the cannot-links, where they cannot be in the same cluster. Collectively known as constrained clustering, it produced more relevant clusters for domain experts. Fast forward a decade, a new favorite pastime is to inject various forms of domain knowledge into Latent Dirichlet Allocation. The goal is to constrain the latent topic assignment of each word, so that latent topic modeling is informed by both data and domain knowledge, and the resulting topics are more relevant for domain experts. We present a few examples that our group has worked on, starting from the simple topic-in-set knowledge where the latent topic of a word is constrained within a small set of candidate topics, to Dirichlet Forest which allows must-links and cannot-links on topics while maintaining conjugacy for efficient inference, to a general framework named Fold.all. Fold.all allows domain experts to express arbitrary knowledge in human-friendly First-Order Logic, and combines it with data using stochastic optimization. This approach enables domain experts to focus on high-level modeling goals instead of the low-level issues involved in creating a custom topic model.
автотехномузыкадетское