vendredi 24 février 2017

Testing for difference in bootstrapped AUC value for different classifiers

I am training five classification models for a binary classification problem. I would like to evaluate their performance and have chosen the classical AUC as the metric I want to use. Now I think that simply basing a decision on one AUC value per classifier would be inssuficient. Therefore I am creating bootstrap samples from the training data set with replacement of the same size of the original training set. So for 50 bootstraps, I have 5 times 50 AUC values which I can plot using a boxplot. Now aside from assessing their difference through visualization I would also like to assess the difference in AUC values per model by a statistical test. I have read about the following test for testing AUC values.

  • Wilcoxon test
  • Friedman's test
  • Delong's test ((DeLong et al. (1988))

As far as I understand I can use the first to detect a significant difference in AUC value of two different methods. Where the difference in AUC's of both models are taken and ranked and then the null-hypothesis is tested that one is statistically larger than the other. I read the example in the following link: http://ift.tt/1U67mvg. For which the implementation can be done using the standard R-function: wilcox.test.

I have also found some papers comparing AUC's using Friedman's test, here and here. Where the second paper seems to use the test to check if there is any difference between a whole set of K classifiers, possibly larger than 2.

As far as I understand the third bootstraps values from the ROC curve to estimate confidence intervals of the AUC and is not really a way of comparing sequences of AUC's for different bootstrap samples.

My question is: Which tests would be appropriate in this case, and are there any no-go's?

Aucun commentaire:

Enregistrer un commentaire