Link to paper: http://www.eecs.harvard.edu/~syrah/pass/pubs/usenix06.pdf
Summary:
1.) A General Taxonomy of Provenance (in 2006...)
2.) Motivation for capturing system-level provenance
3.) PASS: the vision and the implementation
4.) Applications of system-granularity provenance
5.) Future challenges of provenance
Detail:
1.) A General Taxonomy of Provenance (people want provenance for different reasons; your goals affect your implementations)
File, FS, DB approaches
Service-Oriented approaches
Scripting architectures
e.g. build systems maintain dependencies; source code control (e.g.
CVS, SVN); also requires an a priori description of provenance derivation
Environment architectures
2.) System-level Provenance
Arguments for collecting provenance at the system level, automatically:
file granularity (although this is not required; could do byte-level collection), which is fairly detailed
“tight coupling” of data and provenance
* provenance usually stored in a standalone DB; pain to maintain, back up
* completeness; system sees everything
collection transparent to users
provenance-aware systems on higher levels can sit on top of the
OS
Applications/Use Cases for System-Level Generated Provenance
script generation
* people who run workflows over and over again can more easily automate them from looking at the provenance
detecting system changes (e.g. environment variables ⇒ tracks changes in tools, environment, libraries)
* detecting intrusion (this has been a common envisioned application of provenance; in practice there are several difficult questions here: how do you protect the provenance?)
retrieving compile-time flags (something that we all always forget from time to time. PASS automatically collects cmd-line arguments)
build debugging (easy to spot incomplete/missing dependencies if you know what’s supposed to be there)
understanding (visualizing) system dependencies
PASS Itself
Vision:
* PASS should act as the “base-level” system provenance collector; provenance-aware applications should be able to use the storage system to support provenance collection and querying
* PASS should support application-layer provenance (i.e. let applications write to PASS’s DB instead of keeping a separate one around)
* seamless
* provenance security (data and provenance are not equally sensitive)
* support queries on provenance (more about this in the next paper)
What it is:
* collects system-level provenance for all “objects” (in this case, files), stored as an “ancestry” (loose) graph
* edges are processes, nodes are files
* provenance includes several system-level variables, which are collected and stored with object provenance
* defn: a file is either new, or the output of some process
* cannot collect “opaque” provenance, but allows annotations, externally-generated provenance (e.g. from GenePattern) etc.
How:
2 representations of provenance: one in memory and one on disk
* disk: cross-references of file ancestry
* memory: current environment of processes, variables, open files etc. anything that might affect the provenance of some object that becomes persistent
* everything is tracked in memory because we never know when something is going to be relevant to file provenance. If nothing is ever written from a process, then we don’t care; if environment variables have no effect on any files, we don’t care. But if they do, we need to be able to remember them.
Collection (overview; it’s actually more complicated than this*). Goal: intercept system calls, translate into provenance, and maintain provenance graph in memory
* look at system calls, and the data collected for them. For example:
* * execve (execute a program) this starts a process, for which we collect:
⁃ environment
⁃ command line arguments
⁃ process name
⁃ process ID
⁃ kernel version
⁃ kernel modules loaded
⁃ reference to program (input)
⁃ all attached to the current PROCESS
⁃ file path name
⁃ saved for attachment to the target FILE
⁃ reference to file
⁃ attached to the current PROCESS as INPUT
⁃ reference to the current process
⁃ saved in the target FILE'S INODE
* suppresses duplicates of identical ancestors
* * e.g. if a file is large, there may be several reads of the same file; don’t want to create duplicate records
⁃ never throw away any provenance; instead we “version” files when they change
⁃ achieve this by intercepting TRUNCATE operations
⁃ when to version? every write gives provenance explosion --> simple case uses last CLOSE and every SYNC. complicated case uses cycle breaking algorithms.
⁃ cycle breaking
⁃ cycles in ancestry data are non-sensical
⁃ stored in an in-kernel DB
⁃ not entirely relevant because
⁃ a.) it depends on what you want to get out of your implementation
⁃ b.) it depends on how you want to query
⁃ c.) we changed our data model
Future Challenges
several model-rethinks (storage, querying, cycle-breaking...)
security
interoperability is the big problem (for provenance in general)