===== Provenance-Aware Storage Systems (2006) =====

Link to paper: [[http://www.eecs.harvard.edu/~syrah/pass/pubs/usenix06.pdf]]

Summary:
--------
1.) A General Taxonomy of Provenance (in 2006...)

2.) Motivation for capturing system-level provenance

3.) PASS: the vision and the implementation

4.) Applications of system-granularity provenance

5.) Future challenges of provenance

Detail:
-------

1.) A General Taxonomy of Provenance (people want provenance for different reasons; your goals affect your implementations)

File, FS, DB approaches

  * this includes PASS (and maybe Trio)
  * these systems capture everything on some level of abstraction; derivation rules are inferred in real-time (i.e. systems simply capture everything)

Service-Oriented approaches
  * application-level; objects and actions for which provenance is collected are finite and pre-defined (a priori)

Scripting architectures
  * e.g. build systems maintain dependencies; source code control (e.g. CVS, SVN); also requires an a priori description of provenance derivation

Environment architectures
  * also application-level; tracks everything (doesn't require a priori specifications of what's ok) but can't capture provenance outside of application
  * Example: several scinetific 


2.) System-level Provenance

Arguments for collecting provenance at the system level, automatically:
  * file granularity (although this is not required; could do byte-level collection), which is fairly detailed
  * "tight coupling" of data and provenance
  *   * provenance usually stored in a standalone DB; pain to maintain, back up
  *   * completeness; system sees everything
  * collection transparent to users
  * provenance-aware systems on higher levels can sit on top of the OS

Applications/Use Cases for System-Level Generated Provenance
  * script generation
  *   * people who run workflows over and over again can more easily automate them from looking at the provenance
  * detecting system changes (e.g. environment variables => tracks changes in tools, environment, libraries)
  *   * detecting intrusion (this has been a common envisioned application of provenance; in practice there are several difficult questions here: how do you protect the provenance?)
  * retrieving compile-time flags (something that we all always forget from time to time. PASS automatically collects cmd-line arguments)
  * build debugging (easy to spot incomplete/missing dependencies if you know what's supposed to be there)
  * understanding (visualizing) system dependencies

PASS Itself
  * Vision:
  *   * PASS should act as the "base-level" system provenance collector; provenance-aware applications should be able to use the storage system to support provenance collection and querying
  *   * PASS should support application-layer provenance (i.e. let applications write to PASS's DB instead of keeping a separate one around)
  *   * seamless
  *   * provenance security (data and provenance are not equally sensitive)
  *   * support queries on provenance (more about this in the next paper)
  * What it is:

  *   * collects system-level provenance for all "objects" (in this case, files), stored as an "ancestry" (loose) graph
  *   * edges are processes, nodes are files
  *   * provenance includes several system-level variables, which are collected and stored with object provenance
  *   * defn: a file is either new, or the output of some process
  *   * cannot collect "opaque" provenance, but allows annotations, externally-generated provenance (e.g. from GenePattern) etc.

How:
  * 2 representations of provenance: one in memory and one on disk
  *   * disk: cross-references of file ancestry
  *   * memory: current environment of processes, variables, open files etc. anything that might affect the provenance of some object that becomes persistent
  *   * everything is tracked in memory because we never know when something is going to be relevant to file provenance. If nothing is ever written from a process, then we don't care; if environment variables have no effect on any files, we don't care. But if they do, we need to be able to remember them.
  * Collection (overview; it's actually more complicated than this*). Goal: intercept system calls, translate into provenance, and maintain provenance graph in memory
   *  * look at system calls, and the data collected for them. For example:
  *   *   * execve (execute a program) this starts a process, for which we collect:

	⁃	environment
	⁃	command line arguments
	⁃	process name
	⁃	process ID
	⁃	kernel version
	⁃	kernel modules loaded
	⁃	reference to program (input)
	⁃	all attached to the current PROCESS

  *   *   * open (open a file) this suggests we're going to do either a read or a write, and the file we're opening will be somehow involved in the ancestry

	⁃	file path name
	⁃	saved for attachment to the target FILE

  *   *   * read (read a file) as above

	⁃	reference to file
	⁃	attached to the current PROCESS as INPUT

  *   *   * write (write a file)

	⁃	reference to the current process
	⁃	saved in the target FILE'S INODE

  *   * suppresses duplicates of identical ancestors
  *   *   * e.g. if a file is large, there may be several reads of the same file; don't want to create duplicate records

  *   * versioning and cycle detection
	⁃	never throw away any provenance; instead we "version" files when they change
	⁃	achieve this by intercepting TRUNCATE operations
	⁃	when to version? every write gives provenance explosion --> simple case uses last CLOSE and every SYNC. complicated case uses cycle breaking algorithms.
	⁃	cycle breaking
	⁃	cycles in ancestry data are non-sensical

  * Storage


	⁃	stored in an in-kernel DB
	⁃	not entirely relevant because
	⁃	a.) it depends on what you want to get out of your implementation
	⁃	b.) it depends on how you want to query
	⁃	c.) we changed our data model

Future Challenges
  * several model-rethinks (storage, querying, cycle-breaking...)
  * security
  * interoperability is the big problem (for provenance in general)