RELEVANT PAPERS

MOTIVATIONS

  • MiMI, a 250 Mb biological database has 6GB of lineage
  • Lineage of a single tuple in GO may be 9 MB
  • Answer queries regarding which is most likely explanation, which is influential etc.
  • Compress lineage, decrease QP time.

APPLICATIONS

  1. GO (Gene Ontology) Database:
    • Describes associations between proteins and their functions.
    • These associations are integrated from many sources (PubMed, SWISS-PROT, automatic inferrence etc).
    • Mostly low quality data (96% of ‘atoms’ are automatically inferred)
    • Need to infer most likely explanations for the reliability score of a result.
  2. Computing Similarity Scores http://www.iLike.com
    • Similarity score in iLike is a function of frequency of songs listened to, likeness for a particular artist etc.
    • Track influential facts why two people love Jazz or listen to the new MJ album.
    • Return top-K influential facts about iLike’s decision as to why two users A and B have similar tastes.

DEFINITIONS

  • atom: some base fact about real world e.g. data-source: “Dr X’s PubMedID”
  • e-approximation to complete lineage lambda(t): E[(lambdat’ - lambdat)2] < e

ASSUMPTIONS

  • internal lineage functions
  • constant bounded probability distributions on atoms (p(a) > 0 ⇒ p(a) > c (const)).
  • linaege is represented as a k-m DNF.

CONTRIBUTIONS

  1. Introduce two forms of approximate lineage:
    • Sufficient Lineage (less compression, less affected by skew): Can compute sufficient lineage using a randomized algorithm with the following properties:
      • lambdas is an e-approximation to lambdat (complete lineage)
      • # of monomials in lambdas < k! . c-k.(k-1)/2 . logk (1/e) , k = # of referenced base atoms, c = constant of bounded probability distribtion.
    • Polynomial Lineage (more compression, more affected by skew):
      • Proposed a randomized poly-time algorithm to compute polynomial lineage that is a (s, epsilon) approximation to the original (complete) lineage.
      • Obtain influential atoms by sorting the coefficients in polynomial lineage.
  2. Introduce two forms of explanations:
    • Sufficient Explanations
    • Finding Influential atoms
  3. Query Processing with approximate lineage
    • With a sufficient lineage that is an epsilon-approximation of complete lineage lambdat for every tuple t and query q with k subgoals, error of q is constant factor worse than epsilon.
  4. Given a k-mDNF formula, finding a sub-formula with d-monomials with largest probabilities is NP-hard even for k = 3.

DETAILS

Algorithm for Sufficient Lineage

INPUT: monotone k + 1 DNF lambdat
OUTPUT: small sufficient lambdas with e-approximation
Suff(lambdat, e)

  • Find a matching M greedily (set of distinct monomials, no common variables)
  • If M is a good approximation,
    • Trim M till it is still a good approximation.
  • Else,
    • var(M): {xi | xi appears in M} is a cover.
    • arbitrarily assign monomial m to one element that covers m. (Cover)
    • let lambdai = set of monomials associated with xi (in Cover).
    • return V Suff(lambdai, e/c) where c = |Cover|, V = disjunction.

Constructing Polynomial Lineage

Step1: Arithmetize the original lineage formula.
Step2: Approximate using sparse Fourier series.
Step3: Get a (s, e) approximation of the Fourier series (use KM algorithm etc).

 
panda/reading/approximate_lineage_for_probabilistic_databases.txt · Last modified: 2009/12/01 11:18 by ragho
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki