Pig: Dataflow Programming for Map-Reduce Clusters

234
Следующее
Популярные
Опубликовано 6 сентября 2016, 17:32
There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies which routinely process petabytes. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. In this talk I will describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. Pig is used extensively at Yahoo!, and is available to the public as an open-source Apache incubator project.
автотехномузыкадетское