trio:applications [InfoWiki]

A. List of potential Trio applications

"Human Input" Data

Christmas bird count: The original Trio example.

Crime solving: The notorious Trio example.
- However there is a very interesting real-world crime record available from the Cincinnati PD (~100k records).

Terrorist activity detection: Depending on who is funding us. Some interest from the political science department issued.

Biodiversity
- BioACT (InfoLab): Tracing provenance of evolutionary genetic and morphological variations (spatial and temporal).
- Taxonomic classification of specimen: Integration/merging of observations from different biologists with tuple alternatives and different confidences.
- Dolphins

Bio/Genomics Data

Notes: The latter three are of less consideration for now. TAIR is using an object-relational data model. UCSC Genome is too huge and complex, and whether a suitable subset of it can be taken out for the study of Trio is unclear. PharmGKB doesn’t seem to involve much accuracy. We list them here for future consideration.

Entity Resolution/Deduplication

Match&Merge with tuple alternatives, confidences, application-specific mixtures of certain (e.g., SSN) and uncertain attributes (e.g., name)

Information Extraction

Named entity detection in full-text with some IR interest about searching semistructured data and ranking. Various tools available.

Data Integration

BioBike: Bio-specific programming language and first-order inference system for biological knowledge bases (based on Lisp). Deals with a lot of data integration/diversity issues.

Boeing (customer and supplier data from various heterogeneous sources but probably not publicly available; external lineage for result ranking and/or quality ensurance)

DBLife (mixture of IE, ER, and IR applications; many black-box match&merge functions; external lineage tracing of interest)

Medical science and health-care

Notes: External lineage tracing for integrated databases and less structured sources (e.g., URLs) typically of high interest; lineage support for black-box functions and operators.

Sensors and Streams

Digital sensor data (e.g., RFIDs) (uncertainty on input mostly modeled through real-number confidences; no tuple alternatives)

Analogous sensors (uncertainty on input with Gaussians for temperature, geographical location, etc.; no tuple alternatives)

Mirage: Mirage is a microeconomic resource allocation system aimed at addressing the problem of resource allocation in SensorNet testbeds.

Jennifer’s note on Streams and DataCube applications for Trio (Jan. ‘05)

Notes: Potentially incomplete sensor data; signal losses could translate to NULLS in the database as opposed to ‘maybe’ annotations.

More Applications

Jennifer’s note on Call-center data (June ‘05)

“Probabilistic projections” – e.g., storing projected retail sales in an OLAP environment, could include multiple possible projections, queries over the different projections, and so forth.

OCR (provides word alternatives with confidences; but high overhead for a DBMS? Queries?)
Provenance Challenge (mostly about tracing what we call ‘external lineage’ through black-box functions for scientific workflows; recording of input/output data, function parameters)

B. Suitability for Trio

Application	Approximation?	Confidence?	Coverage?	Derived Data?	Other Lineage?	Queries
Data Deduplication	yes: uncertain values	yes: uncertain matches	no	yes: composite records after running a dedup algorithm	updates to base records (maybe)	Given composite record R from a dedup algorithm, find the original records that contributed to R
SMD	yes: inaccurate readings	no	no	yes: raw data → normalized → analyzed	loads, updates	mostly simple projections and joins
SGD	ruled out...
AMOS	no	yes: in the steps of finding overlaps and later “true” overlaps	yes: also in that two steps	yes: a. levels of derivations using various modules: finding overlaps, finding true overlaps, scaffolding, closing gaps; b. from contigs back to raw reads, from contigs to genes, from finished seqs to both reads and contigs	updates	Given a contig C generated by the assembler, find the raw reads that contributed to C
CBC	yes: uncertainty in what bird an observer saw	yes: based on how experienced an observer is	yes: observers may not have seen all the birds that passed by	yes: after pre-processing raw data and obtaining summarized readings	yes-possibly: might want to know what are the records that are affected by a tuple as when an approximation is resolved, one may want to percolate the change to all derived records	bird statistics, trends in population, etc
RFID Quality in Hifi	no	yes: based on signal strength of reading	yes: all tags in a region may not have been read	yes: based on the cleaning-smoothing-arbitration stages of CSAVA to improve the RFID stream quality	maybe: if one figures out that a reader was faulty, then readings due to this reader may be queried	existence-replacement-movement of objects in say a store
Sensor	yes: ranges-gaussians for observed physical properties like temperature, moisture, etc	maybe: the gaussian distributions itself induce a confidence, not sure if a separate confidence is also required initially; but after performing operations, confidence on tuples would crop up	yes: missed readings by sensors	yes: average temperature-moisture over the last 10 minutes, for example	yes: if one detects a faulty sensor, all readings derived from this sensor maybe desired	querying physical parameters over a time period, lots of aggregations, etc...

Other application characteristics that may be interesting but are not in the table for now: schema complexity, data size, high-level observations (Alon: what did you mean by this?)