ExPert: Pareto-Efficient Replicated Task EXecution

24
Опубликовано 7 сентября 2016, 17:56
Many large-scale distributed environments, aka ``clouds'', ``grids'', or ``batch systems'', execute Bags-of-(millions of) Tasks (BOTs). These include Google's Map-Reduce, Intel's NetBatch, and resource-demanding extreme e-Science computations. However, any large-scale distributed system incurs faults, and, when executed on a non-dedicated system, the tasks also experience preemption by higher-priority activities. This is the reason, for example, for the long-tail phenomenon of BOT execution, which may lengthen the turnaround time for BOT computation by hundreds of percentage points. To reduce task turnaround time and to cope with the environment unreliability, users employ task replication, however replication wastes resources and raises non-trivial turnaround-cost trade-offs. Moreover, users usually have a choice of several environments of different reliability and cost characteristics, which make them take hard and non-optimal decisions in sending (replicas of) their tasks for execution. To address this problem we introduce ExPert, a general framework and the associated algorithms and tools for optimizing turnaround-cost trade-offs of BOT execution in mixture of environments. Our framework allows for the selection of a Pareto-efficient task replication strategy, subject to the user-specified utility function, thus minimizing the waste of budget (in terms of energy, cost, etc.) otherwise incurred by the replication policies. We show through mathematical and trace-based analysis that by working with our framework the user may expect a significant cost reduction (even an order of magnitude) with no performance loss, in realistic scenarios.
автотехномузыкадетское