Link to paper: [[http://springerlink.metapress.com/content/r2e8g56tu3bjw8rl/fulltext.pdf]] Environment: Data warehousing setting Warehouse data is derived from sources through DAG of transformations Problem: Given tuple t in warehouse, what set of input tuples comprise t's lineage? Assumptions: - Each transformation is black box taking multiple input sets to multiple output sets. - In general all intermediate results are stored (but techniques are given to reduce storage) but no lineage is recorded by transformations - Properties of transformations may be declared that improve lineage-tracing, both more accurate lineage and efficiency. Main contributions: (1) Properties of individual transformations (2) Tracing algorithms for each property (3) Some theoretical results about properties and transformations (4) Linear sequence of transformations: reduction of intermediate results by merging transformations for lineage-tracing, based on properties (5) Handling multiple inputs and outputs (6) (Minimal) performance experiments (1),(2) Properties and Tracing Algorithms T(I) = O Without additional information, lineage of any o in O is I (a) Dispatcher Definition: T(I) = Union T({i}) Example: Split lists as strings into their components Tracing procedure: To find lineage of o, execute T({i}) for each i and keep ones that produce o (b) Filter Definition: Dispatcher where T({i}) = {i} or emptyset Example: Any filter Tracing procedure: Lineage of o is o (c) Aggregator Definition: Exists unique partition of input so T(Ik) = {ok} and covers entire output Example: Group-by aggregation Tracing procedure: Enumarate subsets of I, find one Ik such that T(Ik) = {o} and T(I-Ik) produces rest of O (d) Context-free aggregator Definition: Essentially, partitioning depends on individual items and not what other items are in the input Example: Group-by aggregation (non-example: clustering) Tracing procedure: Create input partitions by seeing which combinations produce a single output (quadratic), then find the one producing o (linear) (e) Key-preserving aggregator Definition: All subsets of each partition produce an output item with a key that's also in the actual output item for that partition Example: Group-by aggregation with grouping attribute (non-example: group-by aggregation without grouping attribute) Tracing procedure: Create input partitions with linear scan using key property, then pick the one producing o (f) Forward schema mapping Definition: f(A) -> B Partition I based on f(A) values Partition O based on B values T(Ik) = Ok, Ik = all i such that f(i.A) = o.B T(Ik) = emptyset for rest Transformation can have multiple schema mappings Example: Transform R(A,B) to R(A+B) Tracing procedure: Narrow down lineage of o to I subset such that f(i.A) = o.B. Then apply tracing procedure for type of transformation. (g) Backward schema mapping Definition: A <- g(B) Partition I based on A values Partition O based on g(B) values T(Ik) = Ok, Ik = all i such that i.A = g(o.B) T(Ik) = emptyset for rest Example: ??? Tracing procedure: Narrow down lineage of o to I subset such that i.A = g(o.B). Then apply tracing procedure for type of transformation. (h) Forward key-map Definition: Forward schema mapping with no T(Ik) = emptyset and B is key for O. Example: ??? Tracing procedure: Lineage of o is all i such that f(i.A) = o.B (i) Backward key-map Definition: Backward schema mapping where A is key for I Example: ??? Tracing procedure: Lineage of o is unique I element with A.key = g(o.B) (j) Backward total-map Definition: Backward schema mapping where A is all attributes of I Example: ??? Tracing procedure: Lineage of o is g(o.B) Also "tracing procedure" and "transformation inverse" options when available. Hierarchy of properties, tracing characteristics. Choose best one for a transformation. Other issues: works best with stable transformations (no spurious output values), deterministic transformations (4) Combining transformations Properties of combined transformations based on individual properties Mechanics of tracing combined transformations Optimization problem: Combining reduces intermediate storage but may increase computation and/or result in combined transformations with less desirable properties Greedy algorithm, performance measurements (5) Multiple inputs/outputs Exclusive multi-input transformation: Transforms each input separately and unions the result. Treat as multiple transformations for lineage tracing. Inclusive multi-input transformation: Inputs combined together to produce result. Also possible to treat as multiple transformations for lineage tracing. Multiple output: Same deal, treat as separate transformations Generalization: Arbitrary DAGs decomposed into set of linear sequences of transformations, trace through each one.