Data Storage and Processing in Parallel Array Engines

235
Опубликовано 21 июня 2016, 20:08
Scientists today are able to generate data at unprecedented scale and rate. For example the Large Synaptic Survey Telescope (LSST) announced that they will be producing approximately 30 TB of data per night in a few years. Also in many fields of science, multidimensional arrays rather than flat tables are standard data types because data values are associated with coordinates in space and time. For example, images in astronomy are 2D arrays of pixel intensities. Climate and ocean models use arrays or meshes to describe 3D regions of the atmosphere and oceans. As a result, scientists need powerful tools to help them manage these massive arrays. In this talk, I will focus on various challenges in building parallel array data management systems that facilitates massive-scale data analytics over arrays. In particular, I will present AscotDB system which is a collaboration of an interdisciplinary team comprising astronomy and database experts. Our goal is to answer one question: What would be the most transformative tool for processing the next-generation telescope image collections, such as the one that LSST will produce? In AscotDB, we integrated several pieces of technology: the SciDB open-source array engine for data storage and processing, Ascot for graphical data exploration, and Python for easy programmatic access. We built the system on the combination of these three pieces of technology to provide a compelling and powerful environment for the exploration, analysis, visualization, and sharing of large astronomical datasets. In the context of the AscotDB project and also motivated by other array-processing applications, I describe some of the critical challenges for building a parallel array management system and the way we addressed those challenges. In particular, I present three major components of an array engine that I tackled during my ph.d. in the context of the following projects: 1) ArrayStore: Efficient storage management mechanisms to store array on disk. 2) TimeArr: Efficient support for updates and data versioning 3) ArrayLoop: Native support for efficient iterative computations.
Случайные видео
249 дней – 272 0810:54
Waymo…The Real Driverless Car
08.03.16 – 45 92437:10
BOOTED S02E01 - "Make Do"
автотехномузыкадетское