Table of Contents

RELEVANT PAPERS

Efficient Provenance Storage (SIGMOD 2008) – Adriane Chapman, H.V. Jagadish, Prakash Ramanan

MOTIVATIONS

APPLICATIONS

In all of these systems, size of provenance can grow many times larger than base data (since provenance is very fine grained, # of operations of base data is large etc.)

DATA MODEL

EXAMPLE FROM THE PAPER


Here the edges(bottom up) represent the workflow. What the following example says is that some scientist first annotated PubMed article numbered 16524875 via CurateHPRD manipulation to give input data I. This data was mapped using schema mapping MHPRD to suit the scientist’s own schema. The output data was I’. From this data, information about molecule named ABC1, with ID 095477 was collected.

CONTRIBUTIONS

  1. Provenance Compression using the following two approaches:
    1. Provenance Factorization:
      • Basic idea is to factor out “common sub-expressions” in the provenance of different items.
      • Factorization is done at three different levels:
        1. factorization of identical provenance records. (Basic Factorization)
        2. factorization of identical provenance nodes. (Node Factorization)
        3. factorization of nodes that are identical except for their parameters.(Argument Factorization).
          E.g: Let the provenance records be P0 = A → B → C and P1 = A → X → C. Provenance after Basic Factorization will have two records for P0 and P1. However Node Factorization will have a single record A → (B or X) → C. Now let P0 = A → B → C (x1) and P1 = A → B → C (x2), where x1 and x2 are the parameters to manipulation C. Therefore Node Factorization will give 2 different records for P0 and P1. However, Argument Factorization will factor out the manipulations A → B → C and store pointers to them for P0 and P1.
    2. Provenance Inheritance
      • Basic idea is to find similarities in a local portion of the data tree (Structural Inheritance) or between provenance associated with data items of a particular type (Predicate Inheritance).
      • Structural Inheritance: Record provenance of an item if its provenance is different from its parent.
      • Predicate Inheritance: If every data item in a dataset contains the same provenance record, then the record can be moved from instance level provenance to dataset-level provenance.
  2. Combining Compression Techniques
    • For combining the Inheritance based strategies, structural inheritance must be applied before predicate inheritance to avoid potential ambiguity in reconstruction of provenance of data item.
    • Relative Ordering of Inheritance and Factorization doesn’t affect the correctness of compressed provenance, however, empirically it has been observed that applying Inheritance before Factorization causes faster compression.
  3. Identified classes of Queries that utilize provenance (e.g. retrieve provenance for specific data, for all items of type X, joins using provenance etc.)
  4. Combines well with traditional XML compression techniques as XGRIND to yield extremely small provenance stores.