Link to paper: http://portal.acm.org/citation.cfm?id=357777

Environment:
Data warehousing setting
Set of source tables T1..Tn
Warehouse is relational view V over source data, V=Q(T1..Tn)

Problem:
Given tuple t in warehouse view V, what set of source tuples
comprise t's lineage?

Significant differences from Trio lineage:
(1) Query Q used to compute V is known.
(2) Lineage is sets of source tuples, not boolean combinations.
    (Can probably turn into boolean formulas based on Q.)
(3) Lineage is not computed when V is populated. Instead a "lineage
    tracing query" LQ is constructed to obtain t's lineage. LQ is
    based on Q and parameterized by t. LQ is mostly relational with
    a couple of easy special operators (e.g., Split).

Main contributions:
(1) Formal declarative definition of lineage independent of specific
    operators
(2) Lineage for specific operators consistent with formal definition,
    then (inductively) lineage for any relational view
(3) Theorems about correctness, transitivity, equivalent queries, etc.
(4) Method for generating lineage tracing query from view definition
    SPJ: One query
    ASPJ: One query
    Arbitrary aggregation and set operators: sequence of queries
    with intermediate results
(5) Bag semantics: "derivation sets" (all individual derivations) and
    "derivation pools" (all possibly contributing tuples)
(6) Issues related to warehousing environment, e.g., avoid source
    queries

Some details/examples for (1),(2),(4):

(1) Tuple Derivation for an Operator:

T = Op(T1..Tn)
Lineage of t in T is:
a) Maximal subsets T1*..Tn* of T1..Tn such that:
b) Op(T1*..Tn*) = {t}
c) For all t* in Ti*, Op(T1*..T(i-1)*,{t*},T(i+1)*..Tm*) != emptyset

Examples:

Select(A>5) from {5,6}. Lineage of 6 in result is {6}. (b) would allow
{5,6} but (c) makes sure it's not included.

Group-By(A) Sum(B) from {<1,2>,<1,3>,<1,0>}. Lineage of 5 in result is
all three tuples. (b) and (c) would allow <1,0> not to be included,
but (a) makes sure it's included.

(2) Definitions for most individual operators are unsurprising:

Selection, Projection, Join, Aggregation, Union

Difference: Lineage of t in T1-T2 is <{t},T2>. Consistent with
definition.

Definition for arbitrary Q is straightforward induction.

(4) Tracing Queries

SPJ View: Split (Select_c and A=t (T1 join ... join Tn))

ASPJ View: Split (Select_c and G=t.G (T1 join ... join Tn))

General Views: Arbitrary aggregation, union, difference
Split into "AUSPJ-segments" and "D-segments", need "intermediate
views" (materialized or computed on demand)

Example:

Project(B) Select (B>5) GroupBy(A) Sum(B) from {<1,2>,<1,3>,<2,4>,<2,6>}

Lineage of 10 in result needs intermediate result {<1,5>,<2,10>}

Note: Similar intermediate results needed for incremental view
maintenance.

Note: Additional intermediate results needed for tracing entirely at
warehouse.
 
panda/reading/cui1.txt · Last modified: 2009/12/01 11:22 by ragho
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki