Efficient Provenance Storage (SIGMOD 2008) – Adriane Chapman, H.V. Jagadish, Prakash Ramanan

In certain types of workflows, the size of provenance is an order of magnitude larger than the original data (e.g. structure of protein sequences using GRID technology, MiMI etc.)
Provenance should be queriable along with base data.
Incremental maintenance should be supported on provenance stores.

Integration of information about proteins from HPRD and BIND.
MiMI, an online protein interaction database.
Earth System Science Workbench (ESSW).

In all of these systems, size of provenance can grow many times larger than base data (since provenance is very fine grained, # of operations of base data is large etc.)

A basic logical data unit is a data item. These can be tuples or attributes in a relational database, elements or attributes in XML.
A data-set is comprised of a set of data items. Data-sets are manipulated via workflows which is modeled as a directed graph. Each node in the workflow represents a manipulation (e.g. : Select a subset of data based on some predicate, Transform input dataset I to I’ based on schema mapping M).
Provenance Record is the record of input and manipulations applied to the input.
Provenance node is a single manipulation (along with the input and parameters). Provenance Record is a tree of provenance nodes.
Instance Level Provenance: Provenance associated with a particular data item in the data set.
Dataset Level Provenance: Provenance record associated with a set of data items (data set).

Here the edges(bottom up) represent the workflow. What the following example says is that some scientist first annotated PubMed article numbered 16524875 via CurateHPRD manipulation to give input data I. This data was mapped using schema mapping M_HPRD to suit the scientist’s own schema. The output data was I’. From this data, information about molecule named ABC1, with ID 095477 was collected.

Provenance Compression using the following two approaches:
1. Provenance Factorization:
  - Basic idea is to factor out “common sub-expressions” in the provenance of different items.
  - Factorization is done at three different levels:
    1. factorization of identical provenance records. (Basic Factorization)
    2. factorization of identical provenance nodes. (Node Factorization)
    3. factorization of nodes that are identical except for their parameters.(Argument Factorization).
      E.g: Let the provenance records be P₀ = A → B → C and P₁ = A → X → C. Provenance after Basic Factorization will have two records for P₀ and P₁. However Node Factorization will have a single record A → (B or X) → C. Now let P₀ = A → B → C (x1) and P₁ = A → B → C (x2), where x1 and x2 are the parameters to manipulation C. Therefore Node Factorization will give 2 different records for P₀ and P₁. However, Argument Factorization will factor out the manipulations A → B → C and store pointers to them for P₀ and P₁.
2. Provenance Inheritance
  - Basic idea is to find similarities in a local portion of the data tree (Structural Inheritance) or between provenance associated with data items of a particular type (Predicate Inheritance).
  - Structural Inheritance: Record provenance of an item if its provenance is different from its parent.
  - Predicate Inheritance: If every data item in a dataset contains the same provenance record, then the record can be moved from instance level provenance to dataset-level provenance.
Combining Compression Techniques
- For combining the Inheritance based strategies, structural inheritance must be applied before predicate inheritance to avoid potential ambiguity in reconstruction of provenance of data item.
- Relative Ordering of Inheritance and Factorization doesn’t affect the correctness of compressed provenance, however, empirically it has been observed that applying Inheritance before Factorization causes faster compression.
Identified classes of Queries that utilize provenance (e.g. retrieve provenance for specific data, for all items of type X, joins using provenance etc.)
Combines well with traditional XML compression techniques as XGRIND to yield extremely small provenance stores.

RELEVANT PAPERS

MOTIVATIONS

APPLICATIONS

DATA MODEL

EXAMPLE FROM THE PAPER

CONTRIBUTIONS