Relevant Paper
Motivation

The motivation for this paper is very familiar: We know how to trace the lineage for relational operators, but not for black box arbitrary transformations.

Idea

This paper is a very systems oriented paper. There isn’t any formal foundations or theory in it. However the system they have designed could be very useful. The idea is this: Given a black box arbitrary transformation, whose binary is available, they first take the binary and instrument it by some extra instructions to trace lineage. Then while running this instrumented binary, they track for each output of the binary, which input it depends on. The dependency of an output data is based on a binary debugging technique called “Dynamic Slicing”.

Dynamic Slicing is a debugging technique which computes for each instruction in the binary the previously executed instructions that it depends on. Consider the following code:

1. y = input[0]

2. x = input[1];

3. if (x > 5) {

4. output[0] = y

5. }

Let’s consider line 4. In this example the “Dynamic Slice” of line 4 is the union of three things: itself (line 4), the dynamic slice for line 3 (because 4 “conditional depends” on 3) and line 1 (because the variable y is used on the rhs of line 4). Computing the transitive closure of these dynamic slices, we get that line 4 depends on line 1, 2 and 3, 4. Dynamic slice is defined for each line of code (each instruction to be more specific). What they call “data lineage” is adjusting this Dynamic Slicing idea to output-input dependencies. Data lineage is defined for variables and in our example we would say: output[0]’s data lineage is input[0] and input[1].

They have built a very low-level system (an extension of valgrind) which is very slow (10-30x slower than uninstrumented binary) but captures these dependencies for black box transformations whose binaries are available. They record which memory and register locations has been accessed and they have a predefined way of identifying which memory locations correspond to inputs and which correspond to outputs.

They have tested the system on various applications: applications used by biologists (which is also their running example), data mining, machine learning, etc. and they claim that they have been very successful in detecting wrong input by tracing backwards from suspicious output. They have also been successful in finding bugs in the actual transformations by being able to inspect the dependencies between input and output elements.

 
panda/reading/tracingbeyondrelational.txt · Last modified: 2010/03/12 01:44 by reader
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki