Coping with Uncertain Data: Multi-Source Integration and Fuzzy Lookups

Опубликовано 17 августа 2016, 1:11
This talk presents two separate pieces of work that share a common challenge: dealing with uncertainty in data. In the first part of the talk, we address the problem of integrating multiple sources of uncertain data. As an extremely simple motivating example, one image database may label an image as blue or green, while another source labels the same image as green or yellow. As a result of combining information from the two sources, green may be deemed more likely than the other two colors. We will discuss how integration of uncertain data, through both contradiction and corroboration, can yield a more certain result than any of the sources individually. Specifically, we tackle the local-as-view setting of data integration where each source database may be an uncertain database. Our contributions include a new containment definition for uncertain databases, efficient representation and query answering techniques, and coping with inconsistent sources. In the second part, we consider the problem of fuzzy lookups. Keyword search, data cleaning, and entity resolution all rely on efficient fuzzy lookups based on textual similarity functions. We introduce a notion of transformations into the lookup problem, enabling users to specify rules such as 'Robert - Bob' and 'Robert - Bert' that are incorporated into the matching process. We then motivate a similarity function called Jaccard containment, which is an error-tolerant version of set containment. Finally, we present algorithms that enable efficient Jaccard containment lookups in the presence of the uncertainty introduced by transformation rules.