Notes: The latter three are of less consideration for now. TAIR is using an object-relational data model. UCSC Genome is too huge and complex, and whether a suitable subset of it can be taken out for the study of Trio is unclear. PharmGKB doesn’t seem to involve much accuracy. We list them here for future consideration.
Notes: External lineage tracing for integrated databases and less structured sources (e.g., URLs) typically of high interest; lineage support for black-box functions and operators.
Notes: Potentially incomplete sensor data; signal losses could translate to NULLS in the database as opposed to ‘maybe’ annotations.
Application | Approximation? | Confidence? | Coverage? | Derived Data? | Other Lineage? | Queries |
---|---|---|---|---|---|---|
Data Deduplication | yes: uncertain values | yes: uncertain matches | no | yes: composite records after running a dedup algorithm | updates to base records (maybe) | Given composite record R from a dedup algorithm, find the original records that contributed to R |
SMD | yes: inaccurate readings | no | no | yes: raw data → normalized → analyzed | loads, updates | mostly simple projections and joins |
SGD | ruled out... | |||||
AMOS | no | yes: in the steps of finding overlaps and later “true” overlaps | yes: also in that two steps | yes: a. levels of derivations using various modules: finding overlaps, finding true overlaps, scaffolding, closing gaps; b. from contigs back to raw reads, from contigs to genes, from finished seqs to both reads and contigs | updates | Given a contig C generated by the assembler, find the raw reads that contributed to C |
CBC | yes: uncertainty in what bird an observer saw | yes: based on how experienced an observer is | yes: observers may not have seen all the birds that passed by | yes: after pre-processing raw data and obtaining summarized readings | yes-possibly: might want to know what are the records that are affected by a tuple as when an approximation is resolved, one may want to percolate the change to all derived records | bird statistics, trends in population, etc |
RFID Quality in Hifi | no | yes: based on signal strength of reading | yes: all tags in a region may not have been read | yes: based on the cleaning-smoothing-arbitration stages of CSAVA to improve the RFID stream quality | maybe: if one figures out that a reader was faulty, then readings due to this reader may be queried | existence-replacement-movement of objects in say a store |
Sensor | yes: ranges-gaussians for observed physical properties like temperature, moisture, etc | maybe: the gaussian distributions itself induce a confidence, not sure if a separate confidence is also required initially; but after performing operations, confidence on tuples would crop up | yes: missed readings by sensors | yes: average temperature-moisture over the last 10 minutes, for example | yes: if one detects a faulty sensor, all readings derived from this sensor maybe desired | querying physical parameters over a time period, lots of aggregations, etc... |
Other application characteristics that may be interesting but are not in the table for now: schema complexity, data size, high-level observations (Alon: what did you mean by this?)