dimanche 14 février 2021

good way to test `numpy.allclose` on a time series?

I'm trying to test in Python whether a vector of recovered times is close to a vector of ground truth times. Let's ignore how we recover the times, it's not relevant to the question.

My first instinct was to use numpy.allclose, but unless I'm misunderstanding something, allclose is actually a bad fit here because of how it works.

Essentially you specify an absolute tolerance atol and relative tolerance rtol, along with your ground truth vector b and a comparison vector a, and numpy.allclose returns:

all(numpy.abs(a - b) <= atol + rtol * numpy.abs(b))

There's some nuance to what the actual function does as you can see in the source but the "pseudo-numpython" above from the docs gives you the basic idea.

The issue is that with any monotonically-increasing vector of positive values, like a time series, your tolerance actually will increase!

Take this series of times in seconds:

>>> times_true = array([0.01147392, 0.46244898, 0.78571429, 1.22238095, 1.74857143,
   2.30984127, 2.92777778, 3.57      , 4.16634921, 4.76809524])
>>> times_recovered = array([0.00944365, 0.46007857, 0.7838881 , 1.22103095, 1.74722143,
   2.30849127, 2.92642778, 3.56865   , 4.16499921, 4.76674524])

I want my times to be no more than a millisecond apart, plus or minus some wiggle room. This is basically the case for my example vectors.

>>> np.abs(times_recovered - times_true)
array([0.00203027, 0.00237041, 0.00182619, 0.00135   , 0.00135   ,
       0.00135   , 0.00135   , 0.00135   , 0.00135   , 0.00135   ])

since I want the values to be "roughly 1 msec apart", I specify atol to be 0.001 and my rtol to be 0.001. My understanding of these terms right now is that atol is the absolute difference between each element of a and b, i.e., np.abs(a - b), and that rtol is some additional "slop" tolerance we can add. (edit: changed how I defined the terms originally).

Now look at the what this gives me for the second term above:

>>> atol, rtol = 0.001, 0.001
>>> rtol * np.abs(times_true)
array([1.14739229e-05, 4.62448980e-04, 7.85714286e-04, 1.22238095e-03,
   1.74857143e-03, 2.30984127e-03, 2.92777778e-03, 3.57000000e-03,
   4.16634921e-03, 4.76809524e-03])

For this vector of times, we start out with a relative tolerance of 1.e-5 and finish with 1.e-3, a two orders of magnitude difference. In other words, allclose will check whether the differences np.abs(a - b) are less than or equal to the following:

>>> atol + rtol * np.abs(times_true)
array([0.00201147, 0.00246245, 0.00278571, 0.00322238, 0.00374857,
       0.00430984, 0.00492778, 0.00557   , 0.00616635, 0.0067681 ])

This seems bad? I want my tolerance to be roughly the same at every point but it's clearly increasing. And the tolerance will only continue to increase as I get larger times in my vectors. It's also bad because for small times my tolerance will actually be smaller! Giving me false alarms.

It seems like what I should really do is just take np.abs(times_recovered - times_true) and ask whether any of the values are greater than the largest difference I'm willing to tolerate

>>> MAX_DIFF = 0.003
>>> assert not np.any(np.abs(times_recovered - times_true) > MAX_DIFF)

but if so then am I just completely understanding how numpy.allclose is supposed to work? Any feedback from sage scientific Pythonistas would be appreciated

Aucun commentaire:

Enregistrer un commentaire