A. List of potential Trio applications

"Human Input" Data

  • Crime solving: The notorious Trio example.
    • However there is a very interesting real-world crime record available from the Cincinnati PD (~100k records).
  • Terrorist activity detection: Depending on who is funding us. Some interest from the political science department issued.
  • Biodiversity
    • BioACT (InfoLab): Tracing provenance of evolutionary genetic and morphological variations (spatial and temporal).
    • Taxonomic classification of specimen: Integration/merging of observations from different biologists with tuple alternatives and different confidences.
    • Dolphins

Bio/Genomics Data

Notes: The latter three are of less consideration for now. TAIR is using an object-relational data model. UCSC Genome is too huge and complex, and whether a suitable subset of it can be taken out for the study of Trio is unclear. PharmGKB doesn’t seem to involve much accuracy. We list them here for future consideration.

Entity Resolution/Deduplication

  • Match&Merge with tuple alternatives, confidences, application-specific mixtures of certain (e.g., SSN) and uncertain attributes (e.g., name)

Information Extraction

  • Named entity detection in full-text with some IR interest about searching semistructured data and ranking. Various tools available.

Data Integration

  • BioBike: Bio-specific programming language and first-order inference system for biological knowledge bases (based on Lisp). Deals with a lot of data integration/diversity issues.
  • Boeing (customer and supplier data from various heterogeneous sources but probably not publicly available; external lineage for result ranking and/or quality ensurance)
  • DBLife (mixture of IE, ER, and IR applications; many black-box match&merge functions; external lineage tracing of interest)
  • Medical science and health-care

Notes: External lineage tracing for integrated databases and less structured sources (e.g., URLs) typically of high interest; lineage support for black-box functions and operators.

Sensors and Streams

  • Digital sensor data (e.g., RFIDs) (uncertainty on input mostly modeled through real-number confidences; no tuple alternatives)
  • Analogous sensors (uncertainty on input with Gaussians for temperature, geographical location, etc.; no tuple alternatives)
  • Mirage: Mirage is a microeconomic resource allocation system aimed at addressing the problem of resource allocation in SensorNet testbeds.

Notes: Potentially incomplete sensor data; signal losses could translate to NULLS in the database as opposed to ‘maybe’ annotations.

More Applications

  • “Probabilistic projections” – e.g., storing projected retail sales in an OLAP environment, could include multiple possible projections, queries over the different projections, and so forth.
  • OCR (provides word alternatives with confidences; but high overhead for a DBMS? Queries?)
  • Provenance Challenge (mostly about tracing what we call ‘external lineage’ through black-box functions for scientific workflows; recording of input/output data, function parameters)

B. Suitability for Trio

Application Approximation? Confidence? Coverage? Derived Data? Other Lineage? Queries
Data Deduplication yes: uncertain values yes: uncertain matches no yes: composite records after running a dedup algorithm updates to base records (maybe) Given composite record R from a dedup algorithm, find the original records that contributed to R
SMD yes: inaccurate readings no no yes: raw data → normalized → analyzed loads, updates mostly simple projections and joins
SGD ruled out...
AMOS no yes: in the steps of finding overlaps and later “true” overlaps yes: also in that two steps yes: a. levels of derivations using various modules: finding overlaps, finding true overlaps, scaffolding, closing gaps; b. from contigs back to raw reads, from contigs to genes, from finished seqs to both reads and contigs updates Given a contig C generated by the assembler, find the raw reads that contributed to C
CBC yes: uncertainty in what bird an observer saw yes: based on how experienced an observer is yes: observers may not have seen all the birds that passed by yes: after pre-processing raw data and obtaining summarized readings yes-possibly: might want to know what are the records that are affected by a tuple as when an approximation is resolved, one may want to percolate the change to all derived records bird statistics, trends in population, etc
RFID Quality in Hifi no yes: based on signal strength of reading yes: all tags in a region may not have been read yes: based on the cleaning-smoothing-arbitration stages of CSAVA to improve the RFID stream quality maybe: if one figures out that a reader was faulty, then readings due to this reader may be queried existence-replacement-movement of objects in say a store
Sensor yes: ranges-gaussians for observed physical properties like temperature, moisture, etc maybe: the gaussian distributions itself induce a confidence, not sure if a separate confidence is also required initially; but after performing operations, confidence on tuples would crop up yes: missed readings by sensors yes: average temperature-moisture over the last 10 minutes, for example yes: if one detects a faulty sensor, all readings derived from this sensor maybe desired querying physical parameters over a time period, lots of aggregations, etc...

Other application characteristics that may be interesting but are not in the table for now: schema complexity, data size, high-level observations (Alon: what did you mean by this?)

 
trio/applications.txt · Last modified: 2007/04/23 11:50 by theobald
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki