Semi-supervised Clustering: Probabilistic Models, Algorithms and Experiments

1 591
88.4
Опубликовано 6 сентября 2016, 5:30
Clustering is one of the most common data mining tasks, used frequently for data categorization and analysis in both industry and academia. The focus of our research is on semi-supervised clustering, where we study how prior knowledge can be incorporated into clustering algorithms. We present probabilistic models for semi-supervised clustering, develop algorithms based on these models and empirically validate their performances by extensive experiments on datasets from different domains (e.g., text and web data, hand-written character recognition, and bioinformatics). In many domains where clustering is applied, prior knowledge is naturally available in the form of constraints on some of the instances, specifying whether two instances should be in same or different clusters. We focus in particular on the problem of semi-supervised clustering with constraints. We show that this problem has a well-defined underlying probabilistic model of a Hidden Markov Random Field, and we give convergence guarantees of our algorithm for a large class of clustering distortion measures (e.g., squared Euclidean metric, KL divergence, and cosine distance). We propose an active learning algorithm for acquiring maximally informative pairwise constraints in an interactive query-driven framework, which to our knowledge is the first active learning algorithm for constrained semi-supervised clustering. Apart from constrained clustering, we will also discuss other interesting problems of semi-supervised clustering in this talk (e.g., using prior knowledge in the form of category labels on data instances during clustering, incorporating prior knowledge into overlapping clustering of data, semi-supervised graph partitioning using a kernel approach).
автотехномузыкадетское