Failures and Other Challenges of Big-Data Analytics

357
29.8
Опубликовано 28 июля 2016, 0:23
An important challenge faced by today's big-data analytics systems is fault-tolerance: When running a parallel query at large scale, some form of failure is likely to occur during execution. Existing systems typically take one of two radically different strategies to handle failures: restart entire queries or materialize the output of each operator and restart only failed operator partitions. The former approach adds significant overhead when a failure occurs, while the latter adds overhead at runtime and typically introduces global synchronization barriers. In this talk, we present FTOpt, a new approach for making online, parallel query plans fault-tolerant: FTOpt provides intra-query fault-tolerance without blocking. Additionally, it does so by using different fault-tolerance techniques at different operators within a query plan. Enabling each operator to use a different fault-tolerance strategy leads to a space of fault-tolerance plans amenable to cost-based optimization. FTopt comprises a protocol for mixing-and-matching fault-tolerance techniques within a single query plan and an optimizer for selecting the technique to use in order to minimize the expected processing time with failures for the entire query. Experiments show that with as little as one failure, the choice of fault-tolerance approach can result in 70% difference in query runtimes, that often hybrid query plans lead to the best performance, and that our optimizer is able to select a winning plan. In addition to FTOpt, we will also present a broad overview of other research challenges tackled by the ongoing Nuage, CQMS, and Data Eco$y$tem projects at the University of Washington.
автотехномузыкадетское