Provenance-Aware Storage Systems (2006)

Link to paper: http://www.eecs.harvard.edu/~syrah/pass/pubs/usenix06.pdf

Summary:


1.) A General Taxonomy of Provenance (in 2006...)

2.) Motivation for capturing system-level provenance

3.) PASS: the vision and the implementation

4.) Applications of system-granularity provenance

5.) Future challenges of provenance

Detail:


1.) A General Taxonomy of Provenance (people want provenance for different reasons; your goals affect your implementations)

File, FS, DB approaches

  • this includes PASS (and maybe Trio)
  • these systems capture everything on some level of abstraction; derivation rules are inferred in real-time (i.e. systems simply capture everything)

Service-Oriented approaches

  • application-level; objects and actions for which provenance is collected are finite and pre-defined (a priori)

Scripting architectures

  • e.g. build systems maintain dependencies; source code control (e.g. CVS, SVN); also requires an a priori description of provenance derivation

Environment architectures

  • also application-level; tracks everything (doesn’t require a priori specifications of what’s ok) but can’t capture provenance outside of application
  • Example: several scinetific

2.) System-level Provenance

Arguments for collecting provenance at the system level, automatically:

  • file granularity (although this is not required; could do byte-level collection), which is fairly detailed
  • “tight coupling” of data and provenance
  • * provenance usually stored in a standalone DB; pain to maintain, back up
  • * completeness; system sees everything
  • collection transparent to users
  • provenance-aware systems on higher levels can sit on top of the OS

Applications/Use Cases for System-Level Generated Provenance

  • script generation
  • * people who run workflows over and over again can more easily automate them from looking at the provenance
  • detecting system changes (e.g. environment variables ⇒ tracks changes in tools, environment, libraries)
  • * detecting intrusion (this has been a common envisioned application of provenance; in practice there are several difficult questions here: how do you protect the provenance?)
  • retrieving compile-time flags (something that we all always forget from time to time. PASS automatically collects cmd-line arguments)
  • build debugging (easy to spot incomplete/missing dependencies if you know what’s supposed to be there)
  • understanding (visualizing) system dependencies

PASS Itself

  • Vision:
  • * PASS should act as the “base-level” system provenance collector; provenance-aware applications should be able to use the storage system to support provenance collection and querying
  • * PASS should support application-layer provenance (i.e. let applications write to PASS’s DB instead of keeping a separate one around)
  • * seamless
  • * provenance security (data and provenance are not equally sensitive)
  • * support queries on provenance (more about this in the next paper)
  • What it is:
  • * collects system-level provenance for all “objects” (in this case, files), stored as an “ancestry” (loose) graph
  • * edges are processes, nodes are files
  • * provenance includes several system-level variables, which are collected and stored with object provenance
  • * defn: a file is either new, or the output of some process
  • * cannot collect “opaque” provenance, but allows annotations, externally-generated provenance (e.g. from GenePattern) etc.

How:

  • 2 representations of provenance: one in memory and one on disk
  • * disk: cross-references of file ancestry
  • * memory: current environment of processes, variables, open files etc. anything that might affect the provenance of some object that becomes persistent
  • * everything is tracked in memory because we never know when something is going to be relevant to file provenance. If nothing is ever written from a process, then we don’t care; if environment variables have no effect on any files, we don’t care. But if they do, we need to be able to remember them.
  • Collection (overview; it’s actually more complicated than this*). Goal: intercept system calls, translate into provenance, and maintain provenance graph in memory
  • * look at system calls, and the data collected for them. For example:
  • * * execve (execute a program) this starts a process, for which we collect:
⁃	environment
⁃	command line arguments
⁃	process name
⁃	process ID
⁃	kernel version
⁃	kernel modules loaded
⁃	reference to program (input)
⁃	all attached to the current PROCESS
  • * * open (open a file) this suggests we’re going to do either a read or a write, and the file we’re opening will be somehow involved in the ancestry
⁃	file path name
⁃	saved for attachment to the target FILE
  • * * read (read a file) as above
⁃	reference to file
⁃	attached to the current PROCESS as INPUT
  • * * write (write a file)
⁃	reference to the current process
⁃	saved in the target FILE'S INODE
  • * suppresses duplicates of identical ancestors
  • * * e.g. if a file is large, there may be several reads of the same file; don’t want to create duplicate records
  • * versioning and cycle detection

⁃ never throw away any provenance; instead we “version” files when they change

⁃	achieve this by intercepting TRUNCATE operations
⁃	when to version? every write gives provenance explosion --> simple case uses last CLOSE and every SYNC. complicated case uses cycle breaking algorithms.
⁃	cycle breaking
⁃	cycles in ancestry data are non-sensical
  • Storage
⁃	stored in an in-kernel DB
⁃	not entirely relevant because
⁃	a.) it depends on what you want to get out of your implementation
⁃	b.) it depends on how you want to query
⁃	c.) we changed our data model

Future Challenges

  • several model-rethinks (storage, querying, cycle-breaking...)
  • security
  • interoperability is the big problem (for provenance in general)
 
panda/reading/pass06.txt · Last modified: 2009/12/01 16:06 by reader
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki