=== Link to paper === [[http://homepages.inf.ed.ac.uk/opb/papers/sigmod2006.pdf|Provenance Management in Curated Databases]], SIGMOD 2006 === Applications === * Scientific data management: "hand-curated" either by a single individual or by groups of individuals * Specifically, in molecular biology: Uniprot and the Nuclear Protein Database * Also geology and astronomy === High Level Overview === * "Fine-grained/dataflow" provenance as against "Coarse-grained/Workflow" provenance tracking * Recording user actions (copy, insert, delete, update) in a "provenance log", implemented as a wrapper * Two optimization techniques to reduce size of "provenance log": Hierarchical, Transactional optimization * Answering queries using datalog (for example, which transaction first created data x, give all transactions that modified data y) === Recording User Actions === Represent the database as a tree DB/R/tid/F (Database, Relation, Tuple ID, Field) Record the actions of the following type: TID, Op, Loc, Src (Transaction ID, Operation, Location, Source) 100, D, T/a, ⊥ (Delete node T/a) 100, C, T/a, S/b (Copy S/b to T/a) 101, I, T/a/c, ⊥ (Insert c at T/a) Compressions happen in two ways: Hierarchical: Copying an entire sub-tree <=> Copying all individual nodes in the sub-tree, thus recording of each individual node is not required Transactional: Skip intermediate steps in a transaction while recording in a log === Answering Queries === The log gives us an EDB: Prov(t, x, p, q) Here are some useful IDBs: Unchanged: Unch(t, p) ← ¬(∃x,q. Prov(t, x, p, q)) Insert: Ins(t, p) ← Prov(t, I, p, ⊥) Delete: Del(t, p) ← Prov(t, D, p, ⊥) Copy: Copy(t, p, q) ← Prov(t, C, p, q) Basically, these record whether "p" was changed, in a transaction "t". Additionally, we have Trace, which records where "p" came from. Trace(p, t, p, t). Trace(p, t, q, u) ← Trace(p, t, r, s), Trace(r, s, q, u). Trace(p, t, q, t − 1) ← Copy(t, p, q) Trace(p, t, p, t − 1) ← Unch(t, p) This is all we need to write fairly complex queries, such as * Which transaction first created data at a location? (Note tnow is current transaction number) Src(p) = {u | ∃q.Trace(p, tnow , q, u), Ins(u, q)} * What is the sequence of all transactions that copied a node to the current position? Hist(p) = {u | ∃q.Trace(p, tnow , q, u), Copy(u, q)} * Which transactions were responsible for the creation or modification of the subtree under a node? (Note ≤ is the prefix operator) Mod(p) = {u | ∃q.p ≤ q, Trace(q, tnow , r, u), ¬Unch(u, r)} === Implementation === The authors have implemented a "CPDB" (Copy-Paste DB) which works as a wrapper around the traditional database and records user actions.