===== Provenance-Aware Storage Systems (2006) ===== Link to paper: [[http://www.eecs.harvard.edu/~syrah/pass/pubs/usenix06.pdf]] Summary: -------- 1.) A General Taxonomy of Provenance (in 2006...) 2.) Motivation for capturing system-level provenance 3.) PASS: the vision and the implementation 4.) Applications of system-granularity provenance 5.) Future challenges of provenance Detail: ------- 1.) A General Taxonomy of Provenance (people want provenance for different reasons; your goals affect your implementations) File, FS, DB approaches * this includes PASS (and maybe Trio) * these systems capture everything on some level of abstraction; derivation rules are inferred in real-time (i.e. systems simply capture everything) Service-Oriented approaches * application-level; objects and actions for which provenance is collected are finite and pre-defined (a priori) Scripting architectures * e.g. build systems maintain dependencies; source code control (e.g. CVS, SVN); also requires an a priori description of provenance derivation Environment architectures * also application-level; tracks everything (doesn't require a priori specifications of what's ok) but can't capture provenance outside of application * Example: several scinetific 2.) System-level Provenance Arguments for collecting provenance at the system level, automatically: * file granularity (although this is not required; could do byte-level collection), which is fairly detailed * "tight coupling" of data and provenance * * provenance usually stored in a standalone DB; pain to maintain, back up * * completeness; system sees everything * collection transparent to users * provenance-aware systems on higher levels can sit on top of the OS Applications/Use Cases for System-Level Generated Provenance * script generation * * people who run workflows over and over again can more easily automate them from looking at the provenance * detecting system changes (e.g. environment variables => tracks changes in tools, environment, libraries) * * detecting intrusion (this has been a common envisioned application of provenance; in practice there are several difficult questions here: how do you protect the provenance?) * retrieving compile-time flags (something that we all always forget from time to time. PASS automatically collects cmd-line arguments) * build debugging (easy to spot incomplete/missing dependencies if you know what's supposed to be there) * understanding (visualizing) system dependencies PASS Itself * Vision: * * PASS should act as the "base-level" system provenance collector; provenance-aware applications should be able to use the storage system to support provenance collection and querying * * PASS should support application-layer provenance (i.e. let applications write to PASS's DB instead of keeping a separate one around) * * seamless * * provenance security (data and provenance are not equally sensitive) * * support queries on provenance (more about this in the next paper) * What it is: * * collects system-level provenance for all "objects" (in this case, files), stored as an "ancestry" (loose) graph * * edges are processes, nodes are files * * provenance includes several system-level variables, which are collected and stored with object provenance * * defn: a file is either new, or the output of some process * * cannot collect "opaque" provenance, but allows annotations, externally-generated provenance (e.g. from GenePattern) etc. How: * 2 representations of provenance: one in memory and one on disk * * disk: cross-references of file ancestry * * memory: current environment of processes, variables, open files etc. anything that might affect the provenance of some object that becomes persistent * * everything is tracked in memory because we never know when something is going to be relevant to file provenance. If nothing is ever written from a process, then we don't care; if environment variables have no effect on any files, we don't care. But if they do, we need to be able to remember them. * Collection (overview; it's actually more complicated than this*). Goal: intercept system calls, translate into provenance, and maintain provenance graph in memory * * look at system calls, and the data collected for them. For example: * * * execve (execute a program) this starts a process, for which we collect: ⁃ environment ⁃ command line arguments ⁃ process name ⁃ process ID ⁃ kernel version ⁃ kernel modules loaded ⁃ reference to program (input) ⁃ all attached to the current PROCESS * * * open (open a file) this suggests we're going to do either a read or a write, and the file we're opening will be somehow involved in the ancestry ⁃ file path name ⁃ saved for attachment to the target FILE * * * read (read a file) as above ⁃ reference to file ⁃ attached to the current PROCESS as INPUT * * * write (write a file) ⁃ reference to the current process ⁃ saved in the target FILE'S INODE * * suppresses duplicates of identical ancestors * * * e.g. if a file is large, there may be several reads of the same file; don't want to create duplicate records * * versioning and cycle detection ⁃ never throw away any provenance; instead we "version" files when they change ⁃ achieve this by intercepting TRUNCATE operations ⁃ when to version? every write gives provenance explosion --> simple case uses last CLOSE and every SYNC. complicated case uses cycle breaking algorithms. ⁃ cycle breaking ⁃ cycles in ancestry data are non-sensical * Storage ⁃ stored in an in-kernel DB ⁃ not entirely relevant because ⁃ a.) it depends on what you want to get out of your implementation ⁃ b.) it depends on how you want to query ⁃ c.) we changed our data model Future Challenges * several model-rethinks (storage, querying, cycle-breaking...) * security * interoperability is the big problem (for provenance in general)