Link to paper: http://springerlink.metapress.com/content/r2e8g56tu3bjw8rl/fulltext.pdf

Environment:
Data warehousing setting
Warehouse data is derived from sources through DAG of transformations

Problem:
Given tuple t in warehouse, what set of input tuples comprise t's
lineage?

Assumptions:
- Each transformation is black box taking multiple input sets to
  multiple output sets.
- In general all intermediate results are stored (but techniques are
  given to reduce storage) but no lineage is recorded by
  transformations
- Properties of transformations may be declared that improve
  lineage-tracing, both more accurate lineage and efficiency.

Main contributions:
(1) Properties of individual transformations
(2) Tracing algorithms for each property
(3) Some theoretical results about properties and transformations
(4) Linear sequence of transformations: reduction of intermediate
    results by merging transformations for lineage-tracing,
    based on properties
(5) Handling multiple inputs and outputs
(6) (Minimal) performance experiments

(1),(2) Properties and Tracing Algorithms

T(I) = O
Without additional information, lineage of any o in O is I

(a) Dispatcher

  Definition: T(I) = Union T({i})
  Example: Split lists as strings into their components
  Tracing procedure: To find lineage of o, execute T({i}) for each i
    and keep ones that produce o

(b) Filter

  Definition: Dispatcher where T({i}) = {i} or emptyset
  Example: Any filter
  Tracing procedure: Lineage of o is o

(c) Aggregator

  Definition: Exists unique partition of input so T(Ik) = {ok} and
    covers entire output
  Example: Group-by aggregation
  Tracing procedure: Enumarate subsets of I, find one Ik such that
    T(Ik) = {o} and T(I-Ik) produces rest of O

(d) Context-free aggregator

  Definition: Essentially, partitioning depends on individual items
    and not what other items are in the input
  Example: Group-by aggregation (non-example: clustering)
  Tracing procedure: Create input partitions by seeing which
    combinations produce a single output (quadratic), then find the one
    producing o (linear)

(e) Key-preserving aggregator

  Definition: All subsets of each partition produce an output item
    with a key that's also in the actual output item for that partition
  Example: Group-by aggregation with grouping attribute (non-example:
    group-by aggregation without grouping attribute)
  Tracing procedure: Create input partitions with linear scan using
    key property, then pick the one producing o

(f) Forward schema mapping

  Definition: f(A) -> B
    Partition I based on f(A) values
    Partition O based on B values
    T(Ik) = Ok, Ik = all i such that f(i.A) = o.B
    T(Ik) = emptyset for rest
    Transformation can have multiple schema mappings
  Example: Transform R(A,B) to R(A+B)
  Tracing procedure: Narrow down lineage of o to I subset such
    that f(i.A) = o.B. Then apply tracing procedure for type of
    transformation.

(g) Backward schema mapping

  Definition: A <- g(B)
    Partition I based on A values
    Partition O based on g(B) values
    T(Ik) = Ok, Ik = all i such that i.A = g(o.B)
    T(Ik) = emptyset for rest
  Example: ???
  Tracing procedure: Narrow down lineage of o to I subset such
    that i.A = g(o.B). Then apply tracing procedure for type of
    transformation.

(h) Forward key-map

  Definition: Forward schema mapping with no T(Ik) = emptyset and
    B is key for O.
  Example: ???
  Tracing procedure: Lineage of o is all i such that f(i.A) = o.B

(i) Backward key-map

  Definition: Backward schema mapping where A is key for I
  Example: ???
  Tracing procedure: Lineage of o is unique I element with
    A.key = g(o.B)

(j) Backward total-map

  Definition: Backward schema mapping where A is all attributes of I
  Example: ???
  Tracing procedure: Lineage of o is g(o.B)

Also "tracing procedure" and "transformation inverse" options when
available.
Hierarchy of properties, tracing characteristics. Choose best one
for a transformation.

Other issues: works best with stable transformations (no spurious
output values), deterministic transformations

(4) Combining transformations

Properties of combined transformations based on individual properties

Mechanics of tracing combined transformations

Optimization problem: Combining reduces intermediate storage but may
increase computation and/or result in combined transformations with
less desirable properties

Greedy algorithm, performance measurements

(5) Multiple inputs/outputs

Exclusive multi-input transformation: Transforms each input separately
and unions the result. Treat as multiple transformations for lineage
tracing.

Inclusive multi-input transformation: Inputs combined together to
produce result. Also possible to treat as multiple transformations for
lineage tracing.

Multiple output: Same deal, treat as separate transformations

Generalization: Arbitrary DAGs decomposed into set of linear sequences
of transformations, trace through each one.