Robust Semi-Supervised Learning

2 989
41.5
Опубликовано 17 августа 2016, 1:51
Semi-supervised learning algorithms are designed to learn an unknown concept from a partially-labeled data set of training examples. They are widely popular in practice, since labels are often very costly to obtain. This talk is about a new approach to semi-supervised learning that addresses a mismatch between the way semi-supervised learning algorithms have been developed and the way they are commonly used. Most existing semi-supervised learning algorithms are analyzed under the assumption that the algorithm can randomly select a subset of unlabeled training examples and submit them to a labeler. But in many applications of semi-supervised learning, the partially-labeled data is `naturally occurring', and the learning algorithm has no control over which examples are labeled. This is particularly true of data generated by web users. Websites like Facebook, Youtube and Flikr give users the option to label, or 'tag', images, videos and other content, a process that generates very large partially-labeled data sets. We have little understanding of how users select which items to label, but we know that they almost certainly are not selecting them randomly. Instead of assuming that labels are missing at random, we analyze a less favorable scenario where the label information can be missing partially and arbitrarily. We present nearly matching upper and lower generalization bounds for learning in this setting under reasonable assumptions about available label information. Motivated by the analysis, we formulate a convex optimization problem for parameter estimation, derive an efficient algorithm suitable for large data sets, and analyze its convergence. Our algorithm can be viewed as a convex formulation of existing nonconvex approaches to semi-supervised learning, such as posterior regularization, and therefore is less sensitive to local minima. In experiments on several image data sets, we show that our algorithm performs much better than existing semi-supervised learning algorithms under a number of challenging but realistic labeling scenarios. This is joint work with Ben Taskar.
автотехномузыкадетское