jeudi 14 juin 2018

What is the best way to check correct dtypes in a pandas dataframe as part of testing?

Before pre-processing and training a model on some data, I want to check that each feature (each column) of a dataframe is of the correct data type. i.e. if a dataframe has columns col1, col2, col3, they should have types int, float, string respectively as I have defined them (col1 can't be of type string, the order matters).

What is the best way to do this if

  1. The columns have various types - int, float, timestamp, string
  2. There are too many columns (>500) to manually write out / label each column data type

Something like

types = df.dtypes # returns a pandas series
if types != correct_types:
    raise TypeError("Some of the columns do not have the correct type")

Where correct_types are the known data types of each column - these would need to be in the same order as types to ensure each column type is correctly matched. It would also be good to know which column is throwing the error (so maybe a for loop over the columns is more appropriate?)

Is there any way to achieve this, and if so what is the best way to achieve this? Maybe I am looking at the issue the wrong way - more generally, how do I ensure that the columns of df are of the correct data type as I have defined them?

Aucun commentaire:

Enregistrer un commentaire