Microsoft Research335 тыс
Опубликовано 6 сентября 2016, 6:06
In the first part of this talk, I will present a spam filtering method based on statistical data compression models. The nature of these models allows them to be employed as Bayesian text classifiers based on character sequences. The models are fast to construct and incrementally updateable. I will present experimental results indicating that this method performs well in comparison to established spam filters, and that the method is extremely robust to noise, which should make it difficult for spammers to defeat. I will also give some examples, which show that the method is capable of picking up interesting, non-trivial patterns that are indicative of spam/ham. The second part of this talk describes how to exploit structural information for document categorization.  Classifier stacking can be used to exploit the structure of semi-structured documents for improved text categorization performance. In this approach, a meta-classifier is used to combine predictions based on different structural elements. It will be shown that this approach consistently outperforms a flat-text linear SVM on a number of standard text categorization datasets, often by a wide margin. I will present selected nomograms that visualize the resulting meta-classifier and give interesting insight into the characteristics of the datasets.
Свежие видео
Случайные видео