===== RELEVANT PAPERS =====


[[http://www.cs.washington.edu/homes/suciu/approx_lineage.pdf | Approximate Lineage for Probabilistic Databases (VLDB 2008)]] – Chris Re, Dan Suciu

===== MOTIVATIONS =====

  * MiMI, a 250 Mb biological database has 6GB of lineage
  * Lineage of a single tuple in GO may be 9 MB
  * Answer queries regarding which is most likely explanation, which is influential etc.
  * Compress lineage, decrease QP time. 

===== APPLICATIONS =====


  - **GO** (Gene Ontology) Database: 
    * Describes associations between proteins and their functions.
    * These associations are integrated from many sources (PubMed, SWISS-PROT, automatic inferrence etc). 
    * Mostly low quality data (96% of 'atoms' are automatically inferred)
    * Need to infer most likely explanations for the reliability score of a result.
  - Computing **//Similarity Scores//** [[http://www.iLike.com]]
    * Similarity score in iLike is a function of frequency of songs listened to, likeness for a particular artist etc. 
    * Track //influential// facts why two people love Jazz or listen to the new MJ album.  
    * Return top-K influential facts about iLike's decision as to why two users A and B have similar tastes.


===== DEFINITIONS =====


  *atom: some base fact about real world e.g. data-source: "Dr X's PubMedID" 
  *e-approximation to complete lineage lambda(t): E[(lambda<sub>t'</sub> - lambda<sub>t</sub>)<sup>2</sup>] < e
 

===== ASSUMPTIONS =====


  *internal lineage functions
  *constant bounded probability distributions on atoms (p(a) > 0 => p(a) > c (const)).
  *linaege is represented as a k-m DNF.


===== CONTRIBUTIONS =====


  - Introduce two forms of approximate lineage:
    *Sufficient Lineage (less compression, less affected by skew): Can compute sufficient lineage using a randomized algorithm with the following properties:
      * lambda<sub>s</sub> is an e-approximation to lambda<sub>t</sub> (complete lineage)
      * # of monomials in lambda<sub>s</sub> < k! . c<sup>-k.(k-1)/2</sup> . log<sup>k</sup> (1/e)   , k = # of referenced base atoms, c = constant of bounded probability distribtion.  
    * Polynomial Lineage (more compression, more affected by skew):
      * Proposed a randomized poly-time algorithm to compute polynomial lineage that is a (s, epsilon) approximation to the original (complete) lineage. 
      * Obtain influential atoms by sorting the coefficients in polynomial lineage.
  - Introduce two forms of explanations:
    * Sufficient Explanations
    * Finding Influential atoms
  - Query Processing with approximate lineage
    * With a sufficient lineage that is an epsilon-approximation of complete lineage lambda<sub>t</sub> for every tuple t and query q with k subgoals, error of q is constant factor worse than epsilon.
  - Given a k-mDNF formula, finding a sub-formula with d-monomials with largest probabilities is NP-hard even for k = 3.


===== DETAILS =====


=== Algorithm for Sufficient Lineage ===

INPUT: monotone k + 1 DNF lambda<sub>t</sub>\\
OUTPUT: small sufficient lambda<sub>s</sub> with e-approximation\\
**Suff(lambda<sub>t</sub>, e)**
  *Find a matching M greedily (set of distinct monomials, no common variables)
  *If M is a good approximation, 
    *Trim M till it is still a good approximation. 
  *Else,
    *var(M): {x<sub>i</sub> | x<sub>i</sub> appears in M} is a cover.
    *arbitrarily assign monomial m to one element that covers m. (Cover)
    *let lambda<sub>i</sub> = set of monomials associated with x<sub>i</sub> (in Cover).
    *return V Suff(lambda<sub>i</sub>, e/c) where c = |Cover|, V = disjunction. 


=== Constructing Polynomial Lineage ===

Step1: Arithmetize the original lineage formula.\\
Step2: Approximate using sparse Fourier series. \\
Step3: Get a (s, e) approximation of the Fourier series (use KM algorithm etc).\\