Link to paper

Applications

  • Scientific data management: “hand-curated” either by a single individual or by groups of individuals
  • Specifically, in molecular biology: Uniprot and the Nuclear Protein Database
  • Also geology and astronomy

High Level Overview

  • “Fine-grained/dataflow” provenance as against “Coarse-grained/Workflow” provenance tracking
  • Recording user actions (copy, insert, delete, update) in a “provenance log”, implemented as a wrapper
  • Two optimization techniques to reduce size of “provenance log”: Hierarchical, Transactional optimization
  • Answering queries using datalog (for example, which transaction first created data x, give all transactions that modified data y)

Recording User Actions

Represent the database as a tree

           DB/R/tid/F  (Database, Relation, Tuple ID, Field)

Record the actions of the following type:

           TID, Op, Loc, Src (Transaction ID, Operation, Location, Source)
           100, D, T/a, ⊥ (Delete node T/a)
           100, C, T/a, S/b  (Copy S/b to T/a)
           101, I, T/a/c, ⊥ (Insert c at T/a)

Compressions happen in two ways:

           Hierarchical: Copying an entire sub-tree <=> Copying all individual nodes in the sub-tree, thus recording of each individual node is not required
           Transactional: Skip intermediate steps in a transaction while recording in a log

Answering Queries

The log gives us an EDB: Prov(t, x, p, q)

Here are some useful IDBs:

           Unchanged: Unch(t, p) ← ¬(∃x,q. Prov(t, x, p, q)) 
           Insert:    Ins(t, p)  ← Prov(t, I, p, ⊥)
           Delete:    Del(t, p)  ← Prov(t, D, p, ⊥)
           Copy:      Copy(t, p, q) ← Prov(t, C, p, q)

Basically, these record whether “p” was changed, in a transaction “t”.

Additionally, we have Trace, which records where “p” came from.

           Trace(p, t, p, t).
           Trace(p, t, q, u)  ← Trace(p, t, r, s), Trace(r, s, q, u).
           Trace(p, t, q, t − 1) ← Copy(t, p, q)
           Trace(p, t, p, t − 1) ← Unch(t, p)

This is all we need to write fairly complex queries, such as

* Which transaction first created data at a location? (Note tnow is current transaction number)

           Src(p) = {u | ∃q.Trace(p, tnow , q, u), Ins(u, q)}

* What is the sequence of all transactions that copied a node to the current position?

           Hist(p) = {u | ∃q.Trace(p, tnow , q, u), Copy(u, q)}   

* Which transactions were responsible for the creation or modification of the subtree under a node? (Note ≤ is the prefix operator)

           Mod(p)  = {u | ∃q.p ≤ q, Trace(q, tnow , r, u), ¬Unch(u, r)}

Implementation

The authors have implemented a “CPDB” (Copy-Paste DB) which works as a wrapper around the traditional database and records user actions.

 
panda/reading/curated.txt · Last modified: 2009/12/01 11:13 by ragho
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki