Relevant Papers

Applications

  • One of the provenance challenge queries: differencing problem for a dataflow
  • Example: Scientific analysis involving protein annotation to infer biological function of new sequence from other sequences. Need to identify executions which lead to ‘good’ results. So, need to compare different executions to understand difference between runs.
  • Techniques in this paper support answering queries for more complex dataflows which include loops and forked (parallel) execution

Summary

  • Computing minimal edit distance between two arbitrarily shaped execution graphs is NP hard
  • Restrict shape to a specific type of graphs called series-parallel graphs - SP Graphs with well-formed nested forks and loops.
  • Show equivalence of SP Graphs with well-formed forks and loops to SP Trees
  • Provide algorithm to compute ‘unique’ SP Trees from each of the SP Graphs
  • Efficiently compute (in O(|E|3) time) minimal edit distance between two SP Trees and a corresponding edit script with the sequence of edit operations needed to change one tree to another.

Details

Workflow Specification
- directed graph similar to flow networks
-- nodes indicate processes and have unique ids
-- single source (start process) and single sink (final process)
-- directed edge indicates that in a given run, destination node process
   is executed only after source node process
-- loops - dotted backarrow from one node to another
-- fork - dotted oblong surrounding a subgraph - can fork multiple runs 
   of the same subgragh to be run in parallel
-- multiple outgoing edges - non-exclusive choice

Run - Execution Instance
- A non-deterministic recursive function called an 'execution function' generates individual runs from specifications
- In a run, loops are unrolled and the number of fork (parallel) executions 
  is given explicitly.
- Even if the specification has cycles, a valid 
  run is always acyclic, since we unfold the cycles in the 
  specification to capture the sequential order of all iterations in 
  a workflow run. Consequently, the node labels in a run R are 
  not necessarily unique.
- So annotate each node with an instance id to make labels unique
Comparing Runs
- find edit distance and a path edit script (consisting several edit operations)
  between two valid runs of the same specification.
- edit operations: path insertion, path deletion
- cost model: 
-- cost of edit script = sum of cost of individual operations.
-- cost of individual operation can be determined by application
--- can depend on both the length of the path being edited and the labels on either side of the path
--- must be a 'distance metric'

- similar to computing edit distance between ordered trees (polynomial), 
  unordered (NP hard), graphs (NP hard)
- restrict to series-parallel graphs - SP Graphs

- what are SP Graphs?
-- directed multigraph, single source, single sink
-- recursive definition:
--- basic SPG: one source, one sink
--- series composition: sink of one becomes source of another
--- parallel composition: sources of both merged, so are sinks
- more restrictions
-- SP Graph 'structured' with well-nested forks and loops
-- more details about what well-nested forks and loops are (not particularly interesting)
--- define 'laminar family': collection of subsets such that for any pair
    of subsets, one is a subset of the other, or they are disjoint
--- SP graph specification is (G, F), F is a laminar family of series subgraphs of G
    for allowed fork executions

- can construct SP Tree specification from SP Graph (G, F)
- another execution function to generate SP Tree run from SP Tree specification
- Provide algorithm to compute edit distance and edit script between two SP Trees
 
panda/reading/pdiffview.txt · Last modified: 2009/12/01 11:16 by ragho
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki