mercredi 3 février 2021

MachineLearning cross_val_score vs cross_val_predict

While building a generic evaluation tool, I came upon the following problem, where the cross_val_score.mean() gives slightly different results than cross_val_predict.

For calculating the testing score I have the following code, which is calculating the score for each fold and then the mean of all.

testing_score = cross_val_score(clas_model, algo_features, algo_featurest, cv=folds).mean()

For calculating the tp, fp, tn, fn I have the following code, which is calculating these metrics for all folds (i suppose the sum).

test_clas_predictions = cross_val_predict(clas_model, algo_features, algo_featurest, cv=folds)
test_cm = confusion_matrix(algo_featurest, test_clas_predictions)
test_tp = test_cm[1][1]
test_fp = test_cm[0][1]
test_tn = test_cm[0][0]
test_fn = test_cm[1][0]

The outcome of this code is:

                         algo      test  test_tp  test_fp  test_tn  test_fn
5                  GaussianNB  0.719762       25       13      190       71
4          LogisticRegression  0.716429       24       13      190       72
2      DecisionTreeClassifier  0.702381       38       33      170       58
0  GradientBoostingClassifier  0.682619       37       36      167       59
3        KNeighborsClassifier  0.679048       36       36      167       60
1      RandomForestClassifier  0.675952       40       43      160       56

So picking the first line cross_val_score.mean() gave 0.719762 (test) and by calculating the score 25+190/25+13+190+71=0.719063545150... ((tp+tn)/(tp+tn+fp+fn)) which are slighty different.

I had the chance to read this from an article in quora: "In cross_val_predict() elements are grouped slightly different than in cross_val_score(). It means that when you will calculate the same metric using these functions, you can get different results."

Is there any particular reason behind this?

Aucun commentaire:

Enregistrer un commentaire