DataPath: A Data-Centric Analytic Processing Engine

206
Опубликовано 17 августа 2016, 21:31
In this talk I will cover in detail DataPath, a database system designed from ground up to process analytical queries on 1-10TB loads on hardware in sub 100,000$ range. DataPath has a number of unique features. First, it is a data-centric, as opposed to a compute centric, engine that focuses on data-flows rather than computation. The flow of data is of primary concern rather than processing. Throughout the memory hierarchy, once the data is accessed, as much processing as possible is performed. Second, DataPath relies heavily on fast linear scans. By accessing disks in parallel, DataPath is capable of scanning data at more than 2GB/s on 20,000$ hardware. Third, DataPath relies on C++ code generation on the fly. By compiling SQL queries to low overhead C++ code, the execution engine can keep up with the fast I/O and process close to 100 million tuples/second. By consolidating the processing that multiple queries require around the data accesses, DataPath gives excellent performance on compute-intensive tasks like Q1 in TPCH. One of the main focus in the talk will be the interplay between the hardware, operating system and database code and how a careful end-to-end design can ensure good performance throughout. This is work in collaboration with Chris Jermaine at Rice University and students at both UF and Rice.
автотехномузыкадетское