dimanche 7 octobre 2018

After clustering data for training/validation/testing in ML, should I use all samples or make sure there is no bias?

In order to make sure that there is no information leakage between the training/validation/testing sets, I cluster my dataset.

As an example, imagine that I want to train an algorithm to detect whether someone is frowning or smiling. Since the same person may appear in many photos (photographed at a different angle, with a different background etc.) I would put all the photos belonging to the same person into one cluster. I would use separate clusters for training/validation/testing the ML model.

However, I've noticed that many people filter out data to arrive at their non-redundant dataset, so it seems that after clustering they use only one or (a fixed number) of samples from each cluster (perhaps the same number of positive and negative samples from each cluster). Is this necessary in all cases (all ML algorithms) or could I keep all my data in some situations? What to do about the clusters where I only have positive or only negative datapoints (keep/reject or maybe put in the test set)?

Just a final note, rather than image recognition, I am actually working on a chemistry problem - thinking of training random forest and gradient boosting decision trees based on a number of features. Thanks.

Aucun commentaire:

Enregistrer un commentaire