Table of Contents

RELEVANT PAPERS

Approximate Lineage for Probabilistic Databases (VLDB 2008) – Chris Re, Dan Suciu

MOTIVATIONS

APPLICATIONS

  1. GO (Gene Ontology) Database:
    • Describes associations between proteins and their functions.
    • These associations are integrated from many sources (PubMed, SWISS-PROT, automatic inferrence etc).
    • Mostly low quality data (96% of ‘atoms’ are automatically inferred)
    • Need to infer most likely explanations for the reliability score of a result.
  2. Computing Similarity Scores http://www.iLike.com
    • Similarity score in iLike is a function of frequency of songs listened to, likeness for a particular artist etc.
    • Track influential facts why two people love Jazz or listen to the new MJ album.
    • Return top-K influential facts about iLike’s decision as to why two users A and B have similar tastes.

DEFINITIONS

ASSUMPTIONS

CONTRIBUTIONS

  1. Introduce two forms of approximate lineage:
    • Sufficient Lineage (less compression, less affected by skew): Can compute sufficient lineage using a randomized algorithm with the following properties:
      • lambdas is an e-approximation to lambdat (complete lineage)
      • # of monomials in lambdas < k! . c-k.(k-1)/2 . logk (1/e) , k = # of referenced base atoms, c = constant of bounded probability distribtion.
    • Polynomial Lineage (more compression, more affected by skew):
      • Proposed a randomized poly-time algorithm to compute polynomial lineage that is a (s, epsilon) approximation to the original (complete) lineage.
      • Obtain influential atoms by sorting the coefficients in polynomial lineage.
  2. Introduce two forms of explanations:
    • Sufficient Explanations
    • Finding Influential atoms
  3. Query Processing with approximate lineage
    • With a sufficient lineage that is an epsilon-approximation of complete lineage lambdat for every tuple t and query q with k subgoals, error of q is constant factor worse than epsilon.
  4. Given a k-mDNF formula, finding a sub-formula with d-monomials with largest probabilities is NP-hard even for k = 3.

DETAILS

Algorithm for Sufficient Lineage

INPUT: monotone k + 1 DNF lambdat
OUTPUT: small sufficient lambdas with e-approximation
Suff(lambdat, e)

Constructing Polynomial Lineage

Step1: Arithmetize the original lineage formula.
Step2: Approximate using sparse Fourier series.
Step3: Get a (s, e) approximation of the Fourier series (use KM algorithm etc).