Locating Data Sources in Large Distributed Systems

105
Опубликовано 7 сентября 2016, 16:02
The Internet is the fastest growing information medium of all time. Latest estimates [1] report the Internet size as 533,000 TB. Excluding e-mail and instant messaging the remaining 92,000 TB belong to the part of the web that is accessible to users via their web browsers. This part of the web contains the Surface Web and the Deep Web. The Surface Web contains just 170 TB and is the part of the web that is accessible to search engines, and thus is searchable. The other part of the web, the Deep Web is 400-550 times larger than the Surface Web. Deep Web data is usually stored in database back-ends that are inaccessible to search engines and cannot be downloaded to a central data warehouse for querying. Thus, the only way to run queries against the entire content of the Web is by facilitating distributed query processing on the Internet. This talk focuses on locating data sources in large distributed systems, which is an essential component of Internet-scale distributed query processing and is also the focus of my thesis research. Two alternative designs are presented. The first part of the talk describes the Peer Index approach. Experiments using a real system prototype running on 100 and 200 nodes show that knowledge of data location across the data sources is essential to ensure a scalable system. The second part of the talk presents the Distributed Catalog Service. Participating data sources act both as data repositories and catalog service providers. A query issued on any of the participating data sources is sent to all data sources that have results for the query. Data sources with no results for a given query are not contacted. The last part of the talk briefly introduces the problem of distributed aggregation, which is the third part of my thesis research. [1] P. Lyman et al. ΓÇ£How much information?ΓÇ¥ School of Information Management and Systems, UC Berkeley, sims.berkeley.edu/how-much-inf...
автотехномузыкадетское