An Identity Crisis in the Life Sciences

Link: An Identity Crisis in the Life Sciences

The focus of this paper is on managing identities for data products collected and computed by workflows. Taverna is the authors’ system for designing and executing workflows.

An example of a workflow is given below:

Step 1) We call a BLAST service, which takes as input a DNA sequence (Seq0) to align against, and a set of parameter settings, which include the similarity threshold. The output is a BLAST report file.

Step 2) We parse the BLAST report file to recover a collection of similar DNA sequences ({Seq1, ..., SeqN}) to Seq0.

Step 3) For each sequence from the collection from Step 2, retrieve a report from GenBank ({GBRpt1, ..., GBRptN}).

We may want to run the above workflow multiple times. Below are three reasons.

1) Updated data: The BLAST service may have updated data.

2) Different parameters: We may want to rerun the BLAST service with different parameters.

3) Different service: We may want to use a different BLAST service which may use a slightly different BLAST algorithm.

Given multiple runs of the same workflow, we want to compare the results so that we can difference provenance graphs (for seeing how multiple workflow runs differ) or for merging provenance graphs (to see in a unified view how a single data product was derived during multiple runs). To compare provenance graphs, we would like to identify common data products across workflow runs using a common identifier. These common identifiers are generated using IDSets, which are constructed asynchronously to running the workflow so that workflow execution performance is not affected.

 
panda/reading/idcrisis.txt · Last modified: 2009/12/01 11:20 by ragho
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki