Learnable Similarity Functions and Their Applications in Information Integration and Clustering

391
32.6
Следующее
06.09.16 – 1151:05:24
The Semantic Web: Myth and Reality
Популярные
Опубликовано 6 сентября 2016, 6:22
Pairwise similarity functions are ubiquitous in data mining and machine learning algorithms. Record linkage, clustering, nearest-neighbor search, information retrieval - these are all tasks where pairwise distance computations play a central role. Accuracy in these tasks depends critically on how well the similarity function captures the notion of likeness between objects in a given domain. Therefore, it is desirable to employ similarity functions that can adapt to the domain and task at hand. We demonstrate the benefits of using learnable similarity functions on two tasks: record linkage and clustering. The goal of record linkage (also known as de-duplication and identity uncertainty) is to identify different database records that describe the same underlying entity. We introduce several learnable string distance functions based on probabilistic models, as well as an adaptive framework for combining them, both of which lead to significant accuracy improvements. The other task we consider is semi-supervised clustering, where we present a probabilistic clustering framework based on Hidden Markov Random Fields that incorporates learnable similarity functions. Finally, we describe how learning similarity functions allows efficient scaling of record linkage and clustering methods to large datasets.
автотехномузыкадетское