===== RELEVANT PAPERS =====
[[http://www.cs.washington.edu/homes/suciu/approx_lineage.pdf | Approximate Lineage for Probabilistic Databases (VLDB 2008)]] – Chris Re, Dan Suciu
===== MOTIVATIONS =====
* MiMI, a 250 Mb biological database has 6GB of lineage
* Lineage of a single tuple in GO may be 9 MB
* Answer queries regarding which is most likely explanation, which is influential etc.
* Compress lineage, decrease QP time.
===== APPLICATIONS =====
- **GO** (Gene Ontology) Database:
* Describes associations between proteins and their functions.
* These associations are integrated from many sources (PubMed, SWISS-PROT, automatic inferrence etc).
* Mostly low quality data (96% of 'atoms' are automatically inferred)
* Need to infer most likely explanations for the reliability score of a result.
- Computing **//Similarity Scores//** [[http://www.iLike.com]]
* Similarity score in iLike is a function of frequency of songs listened to, likeness for a particular artist etc.
* Track //influential// facts why two people love Jazz or listen to the new MJ album.
* Return top-K influential facts about iLike's decision as to why two users A and B have similar tastes.
===== DEFINITIONS =====
*atom: some base fact about real world e.g. data-source: "Dr X's PubMedID"
*e-approximation to complete lineage lambda(t): E[(lambdat' - lambdat)2] < e
===== ASSUMPTIONS =====
*internal lineage functions
*constant bounded probability distributions on atoms (p(a) > 0 => p(a) > c (const)).
*linaege is represented as a k-m DNF.
===== CONTRIBUTIONS =====
- Introduce two forms of approximate lineage:
*Sufficient Lineage (less compression, less affected by skew): Can compute sufficient lineage using a randomized algorithm with the following properties:
* lambdas is an e-approximation to lambdat (complete lineage)
* # of monomials in lambdas < k! . c-k.(k-1)/2 . logk (1/e) , k = # of referenced base atoms, c = constant of bounded probability distribtion.
* Polynomial Lineage (more compression, more affected by skew):
* Proposed a randomized poly-time algorithm to compute polynomial lineage that is a (s, epsilon) approximation to the original (complete) lineage.
* Obtain influential atoms by sorting the coefficients in polynomial lineage.
- Introduce two forms of explanations:
* Sufficient Explanations
* Finding Influential atoms
- Query Processing with approximate lineage
* With a sufficient lineage that is an epsilon-approximation of complete lineage lambdat for every tuple t and query q with k subgoals, error of q is constant factor worse than epsilon.
- Given a k-mDNF formula, finding a sub-formula with d-monomials with largest probabilities is NP-hard even for k = 3.
===== DETAILS =====
=== Algorithm for Sufficient Lineage ===
INPUT: monotone k + 1 DNF lambdat\\
OUTPUT: small sufficient lambdas with e-approximation\\
**Suff(lambdat, e)**
*Find a matching M greedily (set of distinct monomials, no common variables)
*If M is a good approximation,
*Trim M till it is still a good approximation.
*Else,
*var(M): {xi | xi appears in M} is a cover.
*arbitrarily assign monomial m to one element that covers m. (Cover)
*let lambdai = set of monomials associated with xi (in Cover).
*return V Suff(lambdai, e/c) where c = |Cover|, V = disjunction.
=== Constructing Polynomial Lineage ===
Step1: Arithmetize the original lineage formula.\\
Step2: Approximate using sparse Fourier series. \\
Step3: Get a (s, e) approximation of the Fourier series (use KM algorithm etc).\\