Link to paper: [[http://www.eecs.harvard.edu/~syrah/pass/pubs/ipaw08.pdf]] This is where PASSv2 starts * v1 was mostly about collection but not about storage and querying * v2 remodels storage and querying, includes provenance-aware applications Provenance Data (at least... from the PASS perspective) * ancestry relationships (the core of provenance. these must be acyclic) * object-identity relationships (like objects may reference one another; not required to be acyclic) * * most relationships can be defined by ancestry, but sometimes it's useful to capture it directly ⁃ e.g. "siblings" -> what does this mean? it's unclear. two provenance objects could have the same parent, but different grandparents ⁃ could also have annotations ⁃ deciding what kinds of relationships to detect and label depends on goals an implementation ⁃ most people care only about ancestry, but we should support direct label capturing * edges on graphs are processes: need to be labeled * so far, we have a directed (but not acyclic), labeled graph * experience from challenges: we find things either by attribute names, or by their position in the graph structure; hardly EVER by object name. * comes down to: paths (especially ancestral paths) are first-class objects ⁃ need to follow paths whose structure is not known beforehand (e.g. "find me all the descendants of X) ⁃ need to support matching for regular expressions on path names (especially if we're allowing multi-layer provenance, where attribute naming conventions are not standardized) ⁃ need to support aggregation (i.e. find me all objects with at least 10 ancestors) ⁃ subqueries (find me the objects in graph A that mach those in graph B that are subject to these constraints) * semi-structured data is a natural fit Why Other Models Don't Work * Relational Provenance Shortcomings ⁃ nodes and edges; get paths from crossing edge-list with itself multiple times ⁃ unwieldy, although possible ⁃ no way of comparing if two paths are the same, or if two paths lead to the same object * XML Shortcomings (XML, XQuery, XPath etc.) ⁃ XML is hierarchical, but follows a tree model (i.e. no diamonds); vanilla implementation is unsuitable ⁃ XQuery path always holds the value of the object at the rightmost end; have to truncate to get intermediate results ⁃ XQuery paths are lists of nodes rather than lists of edges ⁃ possible to hack around until it works, but... * RDF and SPARQL shortcomings (child of TriQL) ⁃ Resource Description Framework -> directed, labeled graphs ⁃ In theory, a perfect fit ⁃ but SPARQL has several shortcomings ⁃ doesn't support sub-queries ⁃ doesn't support constraints on path expressions ⁃ doesn't support path expressions of arbitrary length ⁃ ... and more ⁃ several extensions for SPARQL (almost anything you can imagine, if I recall correctly) ⁃ but it's a pain to get all of them to work at once ⁃ can workaround, and may one day be suitable; current workarounds: ⁃ explicitly code the end-node in the query so you don't have to do a path search ⁃ build a full path by repetition (i.e. to find great grandparents, find all parents of x, all parents of parents of x, all parents of parents of parents of x). PQL * extension of Lorel * full discussion of PQL is out of the scope of this talk; it supports everything we need it to support * query format: select OUTPUTS from SOURCES where CONDITION (e.g. on layering paper marked *PQL)