lundi 24 juin 2019

How to find binning for extremely skewed distributions

What I need is to create some test data to put into my test Database. I cannot use live data and when I want to test my logic I need to have "plausible" data in my test Database. Let's say I am working with just one table T (I will need to do the same thing for many tables) that has several columns, some numerical and some categorical. For categorical "dimensions", creating "plausible rows" is easy: if, for instance, I have 3 categories, and I see 50% of the real rows have category A, 30% have category B and 20% have category C, then for each test row that I create I can generate a random number and: if it less than 0.5 I select A, if it is higher than 0.8 I select C, B otherwise.

I would like to have a similar method for continuos dimensions and to that end, I thought about binning. The problem is, I don't know how many bins to use and if it would be better to use bins of different sizes. Ideally, I would like a bin to encompass all the consecutive values that appear with a similar frequency. Unfortunateyl, I have some extremely skewed distributions. Example: column C has value 0 1.4 milion times and in the remaining 0.1 milion rows, it assumes 80K different values, with frequencies ranging from 1 to 250.

I need an algorithm that can handle such extreme situations without the need of a human intervention. A possible distribution here would be: for each row, take a number from 1 to 15. If it is less than 15, the value of the test column is 0, otherwise it is a random value from 0 exclusive up to the maximum value of the column. I am not sure this is a good representation of the table and, most of all, I need to find this distribution authomatically for possible real distributions of value.

I already tried with the Freedman–Diaconis rule but this gives me bins of width 0 because the IQR is 0. Are there other algorithms that I could use?

Thanks a lot

Aucun commentaire:

Enregistrer un commentaire