samedi 17 mars 2018

Training Set, Validation (Test) Set

Dear Stackoverflow members,

I would like to evaluate whether a SVC model is useful for prediction of class membership (2 classes) based on a large set of features. Thus, my dataset is highly dimensional (>600 features) with few samples (50) and 2 classes (true, false). For dimensionality reduction LDA produced better results than PCA.

I'm planning on adopting the following approach, however, I'm unsure whether this makes sense and is statistically robust: Split whole dataset into training set and test set. Fit LDA (to reduce dimensionality) on training set and test set independently. Fit SVC model on training set and use k-fold cross-validation. Evaluate fitted SVC model on test set.

My question is whether this approach is biased and might not provide accurate prediction on future unseen data. Specifically, do I risk bias by fitting LDA on training and test set independently as preprocessing method (since LDA requires information on class membership)? Should I run LDA preprocessing on whole dataset instead?

Thanks in advance.

Aucun commentaire:

Enregistrer un commentaire