Effective Data Versioning for Collaborative Data Science

1 616
20
Опубликовано 25 марта 2019, 19:21
With the increasing number of individuals performing data science in every organization and team, there is a proliferation of dataset versions at various stages of data analysis. More often than not, these dataset versions are stored in an ad-hoc manner in shared file systems, leading to massive redundancy and duplication, and making it impossible to keep track of and find specific versions. In the first part of this talk, I will talk about our developed tool, titled OrpheusDB, on the effective data versioning for structured data. OrpheusDB enables true collaboration via Git-style commands and supports reasoning about various versions via a rich syntax of SQL-like statements. It can compactly store, keep track of, and retrieve versions on demand. In the second part of the talk, I will focus on one of our attempts towards general purpose data versioning by removing assumptions in OrpheusDB. Specifically, I will describe a generalized storage representation for arbitrary data formats, which enables compact storage and meanwhile maintains fast version retrieval.

See more at microsoft.com/en-us/research/v...
Свежие видео
4 дня – 527 7845:15
What is YouTube 1080p Premium?
15 дней – 359 6984:22
The Periodic Table in a 2D World
21 день – 7 6440:08
#Xiaomi1111MegaSale is coming!
автотехномузыкадетское