===== RELEVANT PAPERS ===== [[http://www.cs.washington.edu/homes/suciu/approx_lineage.pdf | Approximate Lineage for Probabilistic Databases (VLDB 2008)]] – Chris Re, Dan Suciu ===== MOTIVATIONS ===== * MiMI, a 250 Mb biological database has 6GB of lineage * Lineage of a single tuple in GO may be 9 MB * Answer queries regarding which is most likely explanation, which is influential etc. * Compress lineage, decrease QP time. ===== APPLICATIONS ===== - **GO** (Gene Ontology) Database: * Describes associations between proteins and their functions. * These associations are integrated from many sources (PubMed, SWISS-PROT, automatic inferrence etc). * Mostly low quality data (96% of 'atoms' are automatically inferred) * Need to infer most likely explanations for the reliability score of a result. - Computing **//Similarity Scores//** [[http://www.iLike.com]] * Similarity score in iLike is a function of frequency of songs listened to, likeness for a particular artist etc. * Track //influential// facts why two people love Jazz or listen to the new MJ album. * Return top-K influential facts about iLike's decision as to why two users A and B have similar tastes. ===== DEFINITIONS ===== *atom: some base fact about real world e.g. data-source: "Dr X's PubMedID" *e-approximation to complete lineage lambda(t): E[(lambdat' - lambdat)2] < e ===== ASSUMPTIONS ===== *internal lineage functions *constant bounded probability distributions on atoms (p(a) > 0 => p(a) > c (const)). *linaege is represented as a k-m DNF. ===== CONTRIBUTIONS ===== - Introduce two forms of approximate lineage: *Sufficient Lineage (less compression, less affected by skew): Can compute sufficient lineage using a randomized algorithm with the following properties: * lambdas is an e-approximation to lambdat (complete lineage) * # of monomials in lambdas < k! . c-k.(k-1)/2 . logk (1/e) , k = # of referenced base atoms, c = constant of bounded probability distribtion. * Polynomial Lineage (more compression, more affected by skew): * Proposed a randomized poly-time algorithm to compute polynomial lineage that is a (s, epsilon) approximation to the original (complete) lineage. * Obtain influential atoms by sorting the coefficients in polynomial lineage. - Introduce two forms of explanations: * Sufficient Explanations * Finding Influential atoms - Query Processing with approximate lineage * With a sufficient lineage that is an epsilon-approximation of complete lineage lambdat for every tuple t and query q with k subgoals, error of q is constant factor worse than epsilon. - Given a k-mDNF formula, finding a sub-formula with d-monomials with largest probabilities is NP-hard even for k = 3. ===== DETAILS ===== === Algorithm for Sufficient Lineage === INPUT: monotone k + 1 DNF lambdat\\ OUTPUT: small sufficient lambdas with e-approximation\\ **Suff(lambdat, e)** *Find a matching M greedily (set of distinct monomials, no common variables) *If M is a good approximation, *Trim M till it is still a good approximation. *Else, *var(M): {xi | xi appears in M} is a cover. *arbitrarily assign monomial m to one element that covers m. (Cover) *let lambdai = set of monomials associated with xi (in Cover). *return V Suff(lambdai, e/c) where c = |Cover|, V = disjunction. === Constructing Polynomial Lineage === Step1: Arithmetize the original lineage formula.\\ Step2: Approximate using sparse Fourier series. \\ Step3: Get a (s, e) approximation of the Fourier series (use KM algorithm etc).\\