New Results in the Theory of Clustering

Microsoft Research330 тыс

Следующее

12.08.16 – 20048:19

Robust Subspace Modeling

Популярные

120 дней – 71115:51

Research Forum 3 | Keynote: Building Globally Equitable AI

143 дня – 545 0790:31

Join us for Research Forum on September 3, 2024

Опубликовано 12 августа 2016, 2:06

The k-means algorithm is still the most popular clustering algorithm in practice. Even though it is known to perform very well in practice, it does not give any approximation guarantees. One way to work around this theoretical shortcoming is to devise a clever seeding technique (the k centers that k-means starts with) that is quick and that comes along with an approximation guarantee. Since the k-means algorithm only improves the solution, the resulting clustering will hopefully be good in practice and there will be a theoretical guarantee of closeness to the optimal clustering (this comes only from the seeding algorithm). There are many seeding algorithms in the Clustering literature. One technique that has gained prominence in the recent past is the k-means++ algorithm. This is a very simple sampling based algorithm that gives an O(log k) approximation in expectation. Even though extremely simple to define, many aspects of this randomized algorithm are not yet fully understood. For instance, does the algorithm give O(1) approximation with probability that is not too small, i.e. 1/poly(k)? Does it behave nicely (in the above sense) when the clusters are appropriately 'separated'. How does the algorithm behave for low dimensional data? Do the results with respect to the squared Euclidean distance measure extend to other distance measure that are used in practice? Can this simple sampling based idea be used in constructing a PTAS? In this talk, we will explore these issues and see answers of a few of these questions. Joint work with Sandeep Sen (IITD), Amit Kumar (IITD), and Nitin Garg (IITD)

Свежие видео