Link to paper: [[http://www.eecs.harvard.edu/~syrah/pass/pubs/ipaw08.pdf]]

This is where PASSv2 starts

  * v1 was mostly about collection but not about storage and querying
  * v2 remodels storage and querying, includes provenance-aware applications

Provenance Data (at least... from the PASS perspective)
  * ancestry relationships (the core of provenance. these must be acyclic)
  * object-identity relationships (like objects may reference one another; not required to be acyclic)
  *   * most relationships can be defined by ancestry, but sometimes it's useful to capture it directly

	⁃	e.g. "siblings" -> what does this mean? it's unclear. two provenance objects could have the same parent, but different grandparents
	⁃	could also have annotations
	⁃	deciding what kinds of relationships to detect and label depends on goals an implementation
	⁃	most people care only about ancestry, but we should support direct label capturing

  * edges on graphs are processes: need to be labeled
  * so far, we have a directed (but not acyclic), labeled graph
  * experience from challenges: we find things either by attribute names, or by their position in the graph structure; hardly EVER by object name. 
  * comes down to: paths (especially ancestral paths) are first-class objects

	⁃	need to follow paths whose structure is not known beforehand (e.g. "find me all the descendants of X)
	⁃	need to support matching for regular expressions on path names (especially if we're allowing multi-layer provenance, where attribute naming conventions are not standardized)
	⁃	need to support aggregation (i.e. find me all objects with at least 10 ancestors)
	⁃	subqueries (find me the objects in graph A that mach those in graph B that are subject to these constraints)
  * semi-structured data is a natural fit

Why Other Models Don't Work

  * Relational Provenance Shortcomings

	⁃	nodes and edges; get paths from crossing edge-list with itself multiple times
	⁃	unwieldy, although possible
	⁃	no way of comparing if two paths are the same, or if two paths lead to the same object

  * XML Shortcomings (XML, XQuery, XPath etc.)

	⁃	XML is hierarchical, but follows a tree model (i.e. no diamonds); vanilla implementation is unsuitable
	⁃	XQuery path always holds the value of the object at the rightmost end; have to truncate to get intermediate results
	⁃	XQuery paths are lists of nodes rather than lists of edges
	⁃	possible to hack around until it works, but...

  * RDF and SPARQL shortcomings (child of TriQL)

	⁃	Resource Description Framework -> directed, labeled graphs
	⁃	In theory, a perfect fit
	⁃	but SPARQL has several shortcomings
	⁃	doesn't support sub-queries
	⁃	doesn't support constraints on path expressions
	⁃	doesn't support path expressions of arbitrary length
	⁃	... and more
	⁃	several extensions for SPARQL (almost anything you can imagine, if I recall correctly)
	⁃	but it's a pain to get all of them to work at once
	⁃	can workaround, and may one day be suitable; current workarounds:
	⁃	explicitly code the end-node in the query so you don't have to do a path search
	⁃	build a full path by repetition (i.e. to find great grandparents, find all parents of x, all parents of parents of x, all parents of parents of parents of x).

PQL

  * extension of Lorel
  * full discussion of PQL is out of the scope of this talk; it supports everything we need it to support
  * query format: select OUTPUTS from SOURCES where CONDITION (e.g. on layering paper marked *PQL)