jeudi 21 mai 2015

SQL find minimal dataset which satisfies all constraints

My question arises from a database test. We are forced to reduce the amount of data in order to arrive at a practically treatable selection, however, this selected data should go through as many branches of the software as possible.

As an example, consider the following case where we have 5 transformations in the code which give a different output based on whether some criterion is satisfied. The data might look like this:

  real_data    crit1   crit2   crit3   crit4   crit5
-----------------------------------------------------
  <......>         0       1       0       1       0
                   1       1       0       0       1
                   1       0       0       1       1
                   0       1       1       0       0
                   1       0       1       1       1

There are many entries in front holding the "real" data --the specific form of which is not relevant here. Further, appended to the table, there are flags whether the criterion in the different transformation steps is satisfied.

The workflow is then like this:

             ________         ________                     ________ 
 data ___\  | step 1 | ____\ | step 2 | ____\ ...  ____\  | step 5 | ____\ result
         /  |  0/1?  |     / |  0/1?  |     /          /  |  0/1?  |     /
            |________|       |________|                   |________|  

In theory, the transformations in the boxes might be correlated, so in the worst case one would have to test 2^5 cases. But clearly, this is not feasible in practice.

Therefore, one often assumes that those transformations are uncorrelated -- and thus one is satisfied as long as each branch is taken one time. In the best case one could use exactly two samples for that, namely,

  real_data    crit1   crit2   crit3   crit4   crit5
-----------------------------------------------------
  <......>         0       0       0       0       0
                   1       1       1       1       1

But such a constellation doesn't exist in the given data set shown before.

So the question is: how do I find a minimal subset of the data such that each criterion-field contains 0 and 1?

(Though I believe this is more an algorithmic question, I've also tagged SQL as the solution shall be realized in DB2 LUW. Moreover, just to note, I have the feeling that dynamic programming is appropriate here.)

Aucun commentaire:

Enregistrer un commentaire