One-sentence overview

The paper proposes a framework for managing annotations on tuples of a relational database, and describes an implementation of the framework over a relational DBMS.

Motivation

The main motivation is the management of scientific data, where data annotating and copying are common operations. More concretely, scientists often place annotations on scientific data to record metadata of interest, e.g., the reliability of a specific piece of information, the accuracy of values that correspond to experimental results, or information on who has accessed or edited a specific datum. Moreover, it is common in the scientific domain to copy and transform data between databases, in which case it is useful to carry the respective annotations from the source to the target database.

Based on this motivation, the paper deals with the following high-level problem: Given a relational database with annotations and a query over the database, compute the annotations that appear in the output of the query.

Annotation Model

Annotations can be placed on the attributes of specific tuples. Formally, let (X,t,i) denote a location in the database instance, where X is the table name, t is a specific tuple in X, and i is the index of an attribute of the tuple. Each location (X,t,i) is associated with a (potentially empty) set of annotations. The paper does not place any semantics on these annotations. The only assumption is that each annotation has a unique id.

The following is an example database in the previous model. The database contains two tables named X and Y. The annotations for each location are shown as a set next to the value.

X A B
1 {t} 10

Y A C
1 {r,s} 20 {u,v}
2 20 {w}

Query Model

The paper essentially considers the class of SPJU queries over an annotated relational database. To handle annotations, the query model is extended with a clause that controls how annotations are carried from the input data to the output data.

Specifically, the authors extend SQL with a PROPAGATE clause. The following is an example query Q over the previous database.

    SELECT DISTINCT x.A AS QA, y.C AS QC
    FROM X x, Y y
    WHERE x.A = y.A
    PROPAGATE x.A TO QA, y.C TO QC

In this example, the PROPAGATE clause specifies that the annotations of attribute QA in the result are copied from attribute x.A of the corresponding witness, and the same holds for QC and y.C. The result of this query is as follows:

QA QB
1 {t} 20 {u,v}

This type of propagation is termed custom, since the query specifies exactly how the annotations are copied through the query. Another choice for the propagation is the default scheme, where the idea is that annotations are copied from the same place as the output data. The following is an example of using the default scheme:

    SELECT DISTINCT x.A AS QA, y.C AS QC
    FROM X x, Y y
    WHERE x.A = y.A
    PROPAGATE DEFAULT

This is the same query as before, except that the propagate clause is different. Clearly, the output attribute QA is copied from attribute x.A and QC is copied from y.C. The default scheme specifies that the annotations are propagated accordingly. It is easy to verify that the propagation under the default scheme is exactly the same as the previous example that used custom propagation. However, consider the following slightly modified query:

    SELECT DISTINCT x.A AS QA, 20 AS QC
    FROM X x,Y y
    WHERE x.A = y.A
    PROPAGATE DEFAULT

In this case, the value of attribute QC is copied from a fresh value 20, which does not contain any annotations. Hence, under the default scheme, QC does not carry any annotations.

One shortcoming of the default scheme is that the propagation depends on the syntax of the query and not its semantics. As an example, consider the following two queries:

    SELECT DISTINCT x.A AS QA, y.C AS QC
    FROM X x, Y y
    WHERE x.A = y.A
    PROPAGATE DEFAULT   
    SELECT DISTINCT y.A AS QA, y.C AS QC
    FROM X x, Y y
    WHERE x.A = y.A
    PROPAGATE DEFAULT

Syntactically, the two queries differ only in the origin of result attribute QA. They are equivalent semantically, as they generate the same result for any instance of the database. However, the output annotations for QA are different because they are copied from different locations in the two queries. This observation motivates the last propagation scheme which is termed DEFAULT-ALL. Under this scheme, the annotations for some query Q are propagated according to a set of queries that are all equivalent to Q. This query set is termed the query basis of Q.

The algorithm to generate the query basis is best described with an example. Consider the following query:

    SELECT DISTINCT x.A AS QA, y.C AS QC
    FROM X x, Y y
    WHERE x.A = y.A
    PROPAGATE DEFAULT-ALL   

The first step is to generate a seed query Q0 that takes into account the transitive equalities present in the query. For our example, the seed query is as follows:

    SELECT DISTINCT x.A AS QA, y.C AS QC
    FROM X x, Y y
    WHERE x.A = y.A
    PROPAGATE x.A to QA, y.A TO QA, y.C TO QC   

Note that the PROPAGATE clause uses the custom scheme to simulate the original DEFAULT scheme, and it also takes into account the transitive equality QA = x.A = y.A. The next step is to generate a set of queries from Q0, where each query performs a self-join to a single relation so that the annotations of equal attribute values are propagated to the output. Here is one possible query in this set:

    SELECT DISTINCT x.A AS QA, y.C AS QC
    FROM X x, Y y, X x'
    WHERE x.A = y.A AND x'.A = x.A
    PROPAGATE x.A to QA, y.A TO QA, y.C TO QC, x'.A to QA

Here is another one:

    SELECT DISTINCT x.A AS QA, y.C AS QC
    FROM X x, Y y, Y y'
    WHERE x.A = y.A AND y'.A = y.A
    PROPAGATE x.A to QA, y.A TO QA, y.C TO QC, y'.A to QA

And here is the last one:

    SELECT DISTINCT x.A AS QA, y.C AS QC
    FROM X x, Y y, Y y'
    WHERE x.A = y.A AND y'.C = y.C
    PROPAGATE x.A to QA, y.A TO QA, y.C TO QC, y'.C to QC

This query is interesting because QC acquires the annotations of a Y.C value even if the corresponding Y.A value does not join with a tuple from X. The annotations for the original query are computed based on the union of the queries in the query set.

The paper describes an algorithm for computing the query basis for any input query Q. A key result is that the number of queries in the basis is polynomial to the size of Q (#relations+#attributes), and the size of each query in the basis is also polynomial in the size of Q.

System Implementation

The paper describes an implementation of the system over a conventional RDBMS. The schema of each relation is augmented with annotation columns, one per original attribute. The domain of each annotation column is the set of annotation ids, which means that the same tuple may repeat several times in order to record the appearance of several annotations.

The query processing pipeline consists of three stages. First, the query is translated to a union of SQL queries, according to the propagation scheme (custom, DEFAULT, DEFAULT-ALL). Second, the union query is evaluated by the RDBMS. Third, the results are post-processed to merge the annotations of duplicate tuples.

Paper Reference

D.Bhagwat, L. Chiticariu, W.C. Tan, G. Vijayvargiya, An annotation management system for relational databases, In VLDB Journal, 14(4), 2005.

 
panda/reading/dbnotes.txt · Last modified: 2009/12/01 11:12 by ragho
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki