panda:reading:chime [InfoWiki]

A Framework for Fine-grained Data Integration and Curation, with Provenance, in a Dataspace

Link: A Framework for Fine-grained Data Integration and Curation, with Provenance, in a Dataspace

The focus of this paper is on keeping track of provenance when performing manual data integration and curation.

To start, the user will copy and paste data from multiple sources into a big table with rows (entities) and columns (attributes). Since the data comes from multiple sources, multiple rows may refer to the same entity, and multiple columns might refer to the same attribute. Thus, we allow the user to either combine two columns (attribute resolution) or two rows (entity resolution). For either operation, the resulting row or column may have conflicting columns. In these situations, the user is required to specify one of the two values as the “correct value.”

Given any entry in the resolved table, we can ask for its provenance (the tree of steps and parent values that led to its current value). This provenance tree is created on the fly when requested. For each operation performed, the data table is updated as well as a history table that records the actions taken. No values are ever deleted from the data table; isVisible fields record whether or not the row or column should be shown in the view table. As an example, entity resolution is performed by setting the visibility of the two input rows to false, then creating a new row with visibility true.

Examples of questions supported by the recorded data are: Which incidents were combined from others in the table? Which incidents in the table had data that was inconsistent then corrected? How many information sources have we found to support the value of a given field?