Provenance-Aware Storage Systems (2006)

Link to paper: http://www.eecs.harvard.edu/~syrah/pass/pubs/usenix06.pdf

Summary:


1.) A General Taxonomy of Provenance (in 2006...)

2.) Motivation for capturing system-level provenance

3.) PASS: the vision and the implementation

4.) Applications of system-granularity provenance

5.) Future challenges of provenance

Detail:


1.) A General Taxonomy of Provenance (people want provenance for different reasons; your goals affect your implementations)

File, FS, DB approaches

Service-Oriented approaches

Scripting architectures

Environment architectures

2.) System-level Provenance

Arguments for collecting provenance at the system level, automatically:

Applications/Use Cases for System-Level Generated Provenance

PASS Itself

How:

⁃	environment
⁃	command line arguments
⁃	process name
⁃	process ID
⁃	kernel version
⁃	kernel modules loaded
⁃	reference to program (input)
⁃	all attached to the current PROCESS
⁃	file path name
⁃	saved for attachment to the target FILE
⁃	reference to file
⁃	attached to the current PROCESS as INPUT
⁃	reference to the current process
⁃	saved in the target FILE'S INODE

⁃ never throw away any provenance; instead we “version” files when they change

⁃	achieve this by intercepting TRUNCATE operations
⁃	when to version? every write gives provenance explosion --> simple case uses last CLOSE and every SYNC. complicated case uses cycle breaking algorithms.
⁃	cycle breaking
⁃	cycles in ancestry data are non-sensical
⁃	stored in an in-kernel DB
⁃	not entirely relevant because
⁃	a.) it depends on what you want to get out of your implementation
⁃	b.) it depends on how you want to query
⁃	c.) we changed our data model

Future Challenges