lundi 29 juin 2020

Convolutional neural networks: prevent cheating in the test phase

I am building a convolutional neural network for image classification. Images are stored in two distinct folders:

  1. training images (8000 files)
  2. test images (2000 files)

Clearly, I need a training set, a validation set and a test set. The first solution is splitting the images in the first folder into a training group and a validation group while using the images in the second folder for testing purposes:

training_datagen     = ImageDataGenerator(rescale = 1./255, validation_split =  0.2)
test_datagen         = ImageDataGenerator(rescale = 1./255)
train_generator      = train_datagen.flow_from_directory(training_path,target_size = (150,150), batch_size = 20, subset = "training", class_mode = "binary") 
validation_generator = train_datagen.flow_from_directory(training_path,target_size = (150,150), batch_size = 20, subset = "validation", class_mode = "binary") 
test_generator       = test_datagen.flow_from_directory(test_path,target_size = (150,150), batch_size = 20, class_mode = "binary") 

This approach becomes problematic when I want to augment my training images because operations like stretching, zooming etc. will be also applied to my validation set. Indeed, one could simply split the first folder into two separate folders (e.g. 6000 images for training, 2000 images for validation) and use ImageDataGenerator() for each folder without the "validation_split" parameter. However, I am not allowed to modify the folder structure, that is, I cannot reorganize the images. My idea is to split the 2000 images in the second folder into a validation set and a test set:

train_datagen   = ImageDataGenerator(rescale            = 1./255,
                                     rotation_range     = rotation_range,
                                     width_shift_range  = width_shift_range,
                                     height_shift_range = height_shift_range,
                                     shear_range        = shear_range,
                                     brightness_range   = brightness_range,
                                     zoom_range         = zoom_range,
                                     horizontal_flip    = horizontal_flip,
                                     fill_mode          = fill_mode)

test_datagen         = ImageDataGenerator(rescale = 1./255, validation_split =  0.5)
train_generator      = train_datagen.flow_from_directory(training_path,target_size = (150,150), batch_size = 20, subset = "training", class_mode = "binary") 
validation_generator = test_datagen.flow_from_directory(test_path,target_size = (150,150), batch_size = 20, subset = "test", class_mode = "binary") 
test_generator       = test_datagen.flow_from_directory(test_path,target_size = (150,150), batch_size = 20, subset = "validation",class_mode = "binary") 

and use the first 1000 images for validation and the remaining 1000 images for testing. Am I cheating in the test phase by doing this?

Thank you!

Aucun commentaire:

Enregistrer un commentaire