Collaborative, Large-Scale Data Analytics and Visualization with Python

124
Опубликовано 27 июля 2016, 2:18
NumPy and recently Pandas have made Python ubiquitous for scientific computing and data analytics. The technical stack for Python works very well for a wide variety of problems that fit in single-address space (RAM of a single computer). For problems that require larger data sets, current solution approaches are to use memory-mapped files, MPI, IPython parallel and/or a standard map-reduce system like Disco (or Hadoop). These techniques typically significantly complicate the software solution from the simple array (table)-oriented expression that makes NumPy (Pandas) so powerful and popular. These approaches can also result in significant data movement throughout the memory hierarchy (which is the common bottleneck in data-centric computing today). Blaze, is an array / table for python that can be used to manage and manipulate very-large, disjoint, data sets in an array-oriented fashion with Python. It is built on a C++-library (dynd) that provides dynamic, multi-dimensional arrays with flexible data types. It also leverages Numba, an array-oriented, python compiler that takes a subset of the Python syntax to LLVM IR and optimized machine code. In this talk I will discuss Blaze and Numba design and roadmap. I will also provide an overview and example of web-based visualizations with Bokeh which allows Python developers to easily produce interactive, web-based visualizations leading in to an overview of Wakari which provides easy access to executable IPython notebooks in the cloud.
автотехномузыкадетское