Joint Cluster Analysis of Attribute Data and Relationship Data: Problems, Algorithms & Applications

147
Опубликовано 7 сентября 2016, 16:48
Attribute data and relationship data are two principle types of data, representing the intrinsic and extrinsic properties of entities. While attribute data has been the main source of data for cluster analysis, relationship data such as social networks or metabolic networks are becoming increasingly available. In many cases these two data types carry complementary information, which calls for a joint cluster analysis of both data types in order to achieve more natural clusterings. For example, when identifying research communities, relationship data could represent co-author relationships and attribute data could represent the research interests of scientists. Communities could then be identified as clusters of connected scientists with similar research interests. Our introduction of joint cluster analysis is part of a recent, broader trend to consider as much background information as possible in the process of cluster analysis, and in general, in data mining. In this talk, we briefly review related work including constrained clustering, semi-supervised clustering and multi-relational clustering. We then propose the Connected k-Center (CkC) problem, which aims at finding k connected clusters minimizing the radius with respect to the attribute data. We sketch the main ideas of the proof of NP-completeness and present a constant factor approximation algorithm for the CkC problem. Since this algorithm does not scale to large datasets, we have also developed NetScan, a heuristic algorithm that is efficient for large, real databases. We report experimental results from two applications, community identification and document clustering, both based on DBLP data. Our experiments demonstrate that NetScan finds clusters that are more meaningful and accurate than the results of existing algorithms. We conclude the talk with other promising applications and new problems of joint cluster analysis. In particular, we discuss the clustering of gene expression data and the hotspot analysis of crime data as well as a joint cluster analysis problem that does not require the user to specify the number of clusters in advance.
автотехномузыкадетское