Effective Scientific Data Management through Provenance Collection

Опубликовано 6 сентября 2016, 16:28
Science has evolved over the past several decades, from an empirical and theoretical approach to one that includes computational simulations and modeling.  Scientific discoveries are increasingly propelled by loosely coupled, inter-disciplinary groups collaborating across geographical boundaries through shared resources in a cyberinfrastructure.  Experience from these science gateways is exposing the challenges posed by the immense quantities of distributed data that are processed, transformed, fused, and reused in complex data processing pipelines modeled as workflows.  The ability to track the derivation history of data -- the data provenance -- forms a key facet in the end-to-end management of these scientific data products. Provenance metadata is used to discover data products based on how they were created and by whom, for monitoring the usage and production of data by workflows, to replay the workflow in order to regenerate the data from its original sources, and for providing a context for reusing data published to the community. In this talk, I describe the Karma provenance framework used for automated collection of data and process provenance from scientific workflows running in the Linked Environments for Atmospheric Discovery (LEAD) meteorology project.  Karma is a standalone web service that uses activities published from instrumented services to track workflow orchestration and data derivation at runtime through a publish-subscribe system.  Karma disseminates different views of provenance such as data provenance, data usage trail, and workflow trace, and allows incremental queries to explore the provenance space. Other than visually describing workflow execution, I am investigating the use of provenance to predict the quality of derived data products, assisting in workflow composition through reasoning systems, and exploring the role of provenance in long term preservation of scientific data.  I will also briefly talk about my work on cataloging geo-spatial metadata, data quality, and data replication and naming done as part of the LEAD project.