testing: data corrupted after train_test

mardi 30 mai 2017

data corrupted after train_test_split

I am using the sklearn train_test_split() method to create my training and test data set. Therefore, I use the following straight forward code:

from sklearn.cross_validation import train_test_split
print(y[0])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(y_train[0])

However, y before the split contains values like this:

[ 0.01689708  0.06298003  0.04147466  0.00460829  0.03686636  0.
  0.02457757  0.01996928  0.0015361   0.01689708  0.01536098  0.0015361
  0.02304148  0.02150538  0.01382488  0.03686636  0.0015361   0.01228879
  0.00460829  0.0015361   0.093702    0.00460829  0.00614439  0.00614439
  0.0030722   0.06605223  0.          0.00768049  0.          0.0030722
  0.0030722   0.0015361   0.0015361   0.0015361   0.0030722   0.0015361
  0.0030722   0.00614439  0.01382488  0.          0.          0.0030722
  0.0015361   0.02150538  0.01228879  0.00921659  0.01382488  0.00460829
  0.0030722   0.0030722   0.0030722   0.0030722   0.01843318  0.00768049
  0.          0.01075269  0.03072197  0.0015361   0.00460829  0.00460829
  0.00614439  0.00614439  0.0030722   0.          0.01075269  0.0030722
  0.0015361   0.02611367  0.0015361   0.          0.          0.0030722   0.
  0.          0.          0.0030722   0.0030722   0.          0.00460829
  0.0015361   0.01996928  0.00460829  0.0030722   0.0030722   0.01382488
  0.0030722   0.01536098  0.00768049  0.00460829  0.02918587  0.          0.0015361
  0.02304148  0.0030722   0.00768049  0.00768049  0.          0.0030722
  0.03225806  0.0030722 ]

While the print after the split yields only 0's:

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

Of course, the 0 index does not correspond to the same datapoint after splitting. But the y data before the split, does not contain any 0's.

testing

mardi 30 mai 2017

data corrupted after train_test_split

Aucun commentaire:

Enregistrer un commentaire