Effective Data Versioning for Collaborative Data Science

1 617

19.3

Microsoft Research336 тыс

Следующее

25.03.19 – 1 44958:35

Adversarial Benchmarks for Commonsense Reasoning

Популярные

268 дней – 71242:19

Combining Machine Learning and Bayesian networks for Decision Support in Arrythmia Diagnosis

09.08.23 – 1 26656:17

Keypoint Detection for Measuring Body Size of Giraffes: Enhancing Accuracy and Precision

Опубликовано 25 марта 2019, 19:21

With the increasing number of individuals performing data science in every organization and team, there is a proliferation of dataset versions at various stages of data analysis. More often than not, these dataset versions are stored in an ad-hoc manner in shared file systems, leading to massive redundancy and duplication, and making it impossible to keep track of and find specific versions. In the first part of this talk, I will talk about our developed tool, titled OrpheusDB, on the effective data versioning for structured data. OrpheusDB enables true collaboration via Git-style commands and supports reasoning about various versions via a rich syntax of SQL-like statements. It can compactly store, keep track of, and retrieve versions on demand. In the second part of the talk, I will focus on one of our attempts towards general purpose data versioning by removing assumptions in OrpheusDB. Specifically, I will describe a generalized storage representation for arbitrary data formats, which enables compact storage and meanwhile maintains fast version retrieval.

See more at microsoft.com/en-us/research/v...

Свежие видео