Resource-Efficient Redundancy for Large-Scale Data Processing and Storage Systems

1 327
19.2
Опубликовано 26 сентября 2019, 17:37
Large-scale systems are often subject to non-ideal conditions such as failures, stragglers, load imbalance, etc. These issues adversely affect query latency in data-processing systems, and durability and access latency in storage systems. Redundancy (duplication of data and/or queries) is a common approach employed to impart resilience against such adverse effects. In this talk, I will present two sets of results that take fundamentally new approaches to adding redundancy in data processing and storage systems, blending tools from coding theory and machine learning along with systems insights:

(1) A novel learning-and-coding-based resilient computation framework and its application to reducing tail latency in serving neural network models for a variety of tasks such as image classification, speech recognition, and object detection. Our solution is the first to overcome a challenging barrier that limited the applicability of existing coding-based resilient computation approaches to a severely limited class of functions.

(2) A new redundancy-configuration approach for large-scale storage systems that exploits reliability heterogeneity in storage devices to achieve significant cost savings. Our solution contests the widely used static approach to configuring redundancy by proposing a dynamic data-driven approach that tailors redundancy levels to observed failure rates. Using a production data set, we show 11-16% reduction in storage space even in highly-optimized erasure-coded storage systems, translating to significant cost savings in large-scale operations.

Talk slides: microsoft.com/en-us/research/u...

Learn more about this and other talks at Microsoft Research: microsoft.com/en-us/research/v...
автотехномузыкадетское