RELEVANT PAPERS
MOTIVATIONS
MiMI, a 250 Mb biological database has 6GB of lineage
Lineage of a single tuple in GO may be 9
MB
Answer queries regarding which is most likely explanation, which is influential etc.
Compress lineage, decrease QP time.
APPLICATIONS
GO (Gene Ontology) Database:
Describes associations between proteins and their functions.
These associations are integrated from many sources (PubMed, SWISS-PROT, automatic inferrence etc).
Mostly low quality data (96% of ‘atoms’ are automatically inferred)
Need to infer most likely explanations for the reliability score of a result.
-
Similarity score in iLike is a function of frequency of songs listened to, likeness for a particular artist etc.
Track influential facts why two people love Jazz or listen to the new MJ album.
Return top-K influential facts about iLike’s decision as to why two users A and B have similar tastes.
DEFINITIONS
ASSUMPTIONS
internal lineage functions
constant bounded probability distributions on atoms (p(a) > 0 ⇒ p(a) > c (const)).
linaege is represented as a k-m DNF.
CONTRIBUTIONS
Introduce two forms of approximate lineage:
Sufficient Lineage (less compression, less affected by skew): Can compute sufficient lineage using a randomized algorithm with the following properties:
lambdas is an e-approximation to lambdat (complete lineage)
# of monomials in lambdas < k! . c-k.(k-1)/2 . logk (1/e) , k = # of referenced base atoms, c = constant of bounded probability distribtion.
Polynomial Lineage (more compression, more affected by skew):
Proposed a randomized poly-time algorithm to compute polynomial lineage that is a (s, epsilon) approximation to the original (complete) lineage.
Obtain influential atoms by sorting the coefficients in polynomial lineage.
Introduce two forms of explanations:
Query Processing with approximate lineage
Given a k-mDNF formula, finding a sub-formula with d-monomials with largest probabilities is NP-hard even for k = 3.
DETAILS
Algorithm for Sufficient Lineage
INPUT: monotone k + 1 DNF lambdat
OUTPUT: small sufficient lambdas with e-approximation
Suff(lambdat, e)
Find a matching M greedily (set of distinct monomials, no common variables)
If M is a good approximation,
Else,
var(M): {xi | xi appears in M} is a cover.
arbitrarily assign monomial m to one element that covers m. (Cover)
let lambdai = set of monomials associated with xi (in Cover).
return V Suff(lambdai, e/c) where c = |Cover|, V = disjunction.
Constructing Polynomial Lineage
Step1: Arithmetize the original lineage formula.
Step2: Approximate using sparse Fourier series.
Step3: Get a (s, e) approximation of the Fourier series (use KM algorithm etc).