Provenance Collection Support in the Kepler Scientific Workflow System

Link: Provenance Collection Support in the Kepler Scientific Workflow System

The focus of this paper is on efficiently rerunning workflows by using results from previous runs.

Kepler is a system for specifying and running workflows. A workflow consists of actors (transformations) connected by directed edges. Kepler provides a GUI with draggable elements that makes it easy to construct workflows by using actors from a library.

When a workflow is run, intermediate results are passed between actors. Provenance in Kepler consists of these intermediate results passed between actors. A designer of a workflow may evolve the workflow over time, perhaps by changing the parameters of actors, adding or removing actors, or changing how actors are connected. When we run a new version of a workflow, we would like to execute the workflow efficiently by reusing the provenance from previous reruns.

One subtlety is that some actors are non-cachable. An example of a non-cachable actor is one that downloads data from a remote database. This actor is non-cachable because the actor doesn’t necessarily return the same result when rerun with the same parameters. When doing a “smart rerun”, we can either rerun the non-cachable actor or not, depending on the importance of freshness.

 
panda/reading/kepler.txt · Last modified: 2009/12/01 11:21 by ragho
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki