GO (Gene Ontology) Database:
- Describes associations between proteins and their functions.
- These associations are integrated from many sources (PubMed, SWISS-PROT, automatic inferrence etc).
- Mostly low quality data (96% of ‘atoms’ are automatically inferred)
- Need to infer most likely explanations for the reliability score of a result.
Computing Similarity Scores http://www.iLike.com
- Similarity score in iLike is a function of frequency of songs listened to, likeness for a particular artist etc.
- Track influential facts why two people love Jazz or listen to the new MJ album.
- Return top-K influential facts about iLike’s decision as to why two users A and B have similar tastes.

DEFINITIONS

atom: some base fact about real world e.g. data-source: “Dr X’s PubMedID”
e-approximation to complete lineage lambda(t): E[(lambda_t’ - lambda_t)²] < e

ASSUMPTIONS

internal lineage functions
constant bounded probability distributions on atoms (p(a) > 0 ⇒ p(a) > c (const)).
linaege is represented as a k-m DNF.

CONTRIBUTIONS

Introduce two forms of approximate lineage:
- Sufficient Lineage (less compression, less affected by skew): Can compute sufficient lineage using a randomized algorithm with the following properties:
  - lambda_s is an e-approximation to lambda_t (complete lineage)
  - # of monomials in lambda_s < k! . c^-k.(k-1)/2 . log^k (1/e) , k = # of referenced base atoms, c = constant of bounded probability distribtion.
- Polynomial Lineage (more compression, more affected by skew):
  - Proposed a randomized poly-time algorithm to compute polynomial lineage that is a (s, epsilon) approximation to the original (complete) lineage.
  - Obtain influential atoms by sorting the coefficients in polynomial lineage.
Introduce two forms of explanations:
- Sufficient Explanations
- Finding Influential atoms
Query Processing with approximate lineage
- With a sufficient lineage that is an epsilon-approximation of complete lineage lambda_t for every tuple t and query q with k subgoals, error of q is constant factor worse than epsilon.
Given a k-mDNF formula, finding a sub-formula with d-monomials with largest probabilities is NP-hard even for k = 3.

DETAILS

Algorithm for Sufficient Lineage

INPUT: monotone k + 1 DNF lambda_t
OUTPUT: small sufficient lambda_s with e-approximation
Suff(lambda_t, e)

Find a matching M greedily (set of distinct monomials, no common variables)
If M is a good approximation,
- Trim M till it is still a good approximation.
Else,
- var(M): {x_i | x_i appears in M} is a cover.
- arbitrarily assign monomial m to one element that covers m. (Cover)
- let lambda_i = set of monomials associated with x_i (in Cover).
- return V Suff(lambda_i, e/c) where c = |Cover|, V = disjunction.

Constructing Polynomial Lineage

Step1: Arithmetize the original lineage formula.
Step2: Approximate using sparse Fourier series.
Step3: Get a (s, e) approximation of the Fourier series (use KM algorithm etc).

RELEVANT PAPERS

MOTIVATIONS

APPLICATIONS