Link to paper

Provenance Management in Curated Databases, SIGMOD 2006

Applications

High Level Overview

Recording User Actions

Represent the database as a tree

           DB/R/tid/F  (Database, Relation, Tuple ID, Field)

Record the actions of the following type:

           TID, Op, Loc, Src (Transaction ID, Operation, Location, Source)
           100, D, T/a, ⊥ (Delete node T/a)
           100, C, T/a, S/b  (Copy S/b to T/a)
           101, I, T/a/c, ⊥ (Insert c at T/a)

Compressions happen in two ways:

           Hierarchical: Copying an entire sub-tree <=> Copying all individual nodes in the sub-tree, thus recording of each individual node is not required
           Transactional: Skip intermediate steps in a transaction while recording in a log

Answering Queries

The log gives us an EDB: Prov(t, x, p, q)

Here are some useful IDBs:

           Unchanged: Unch(t, p) ← ¬(∃x,q. Prov(t, x, p, q)) 
           Insert:    Ins(t, p)  ← Prov(t, I, p, ⊥)
           Delete:    Del(t, p)  ← Prov(t, D, p, ⊥)
           Copy:      Copy(t, p, q) ← Prov(t, C, p, q)

Basically, these record whether “p” was changed, in a transaction “t”.

Additionally, we have Trace, which records where “p” came from.

           Trace(p, t, p, t).
           Trace(p, t, q, u)  ← Trace(p, t, r, s), Trace(r, s, q, u).
           Trace(p, t, q, t − 1) ← Copy(t, p, q)
           Trace(p, t, p, t − 1) ← Unch(t, p)

This is all we need to write fairly complex queries, such as

* Which transaction first created data at a location? (Note tnow is current transaction number)

           Src(p) = {u | ∃q.Trace(p, tnow , q, u), Ins(u, q)}

* What is the sequence of all transactions that copied a node to the current position?

           Hist(p) = {u | ∃q.Trace(p, tnow , q, u), Copy(u, q)}   

* Which transactions were responsible for the creation or modification of the subtree under a node? (Note ≤ is the prefix operator)

           Mod(p)  = {u | ∃q.p ≤ q, Trace(q, tnow , r, u), ¬Unch(u, r)}

Implementation

The authors have implemented a “CPDB” (Copy-Paste DB) which works as a wrapper around the traditional database and records user actions.