Hi, I’m trying to build a very basic cnn to do multiclass image classification and getting a little stuck on one of the first steps of splitting the data! Following a youtube tutorial I initially created a dataset using tf.keras.utils.image_dataset_from_directory() then split it into train/valid/test using .skip() and .take(). While the model worked great I noticed that the test set changed each time I used it (even if I only had 1 batch). My understanding is that doing this method, every time you use all the data, it reshuffles and redraws all of the samples. So, 1. Is this then a source of data leakage? In that as you train the CNN, at each epoch it redraws the samples and hence the model has already seen the test set?
As a result I decided to try creating a separate directory for test data that I don’t touch and just doing the train/validation split. Now, from reading online i realised I could just do this using validation_split keyword… However, that brings up the second question where if I do the split using validation split (Method 1 below) I only get validation accuracy up to about 0.5 during training, whereas if I do the skip(), take() method (Method 2) I can get up to 0.95… I’m clearly doing something different with these two methods but can’t see it. Could anyone explain what it is? And which method is better?
## METHOD 1 ##
validation_split = 0.2
train1 = tf.keras.utils.image_dataset_from_directory(
train_dir,
validation_split = validation_split,
subset = "training",
seed = RANDOM_STATE)
val1 = tf.keras.utils.image_dataset_from_directory(
train_dir,
validation_split = validation_split,
subset = "validation",
seed = RANDOM_STATE)
train1 = train1.map(lambda x,y: (x/255.,y))
val1 = val1.map(lambda x,y: (x/255.,y))
## METHOD 2
data = tf.keras.utils.image_dataset_from_directory(train_dir)
# Scale the pixel data to between 0 and 1
data = data.map(lambda x,y: (x/255.,y))
# Split into train, validation and test samples
n_batchs = len(data)
train_size = int(n_batchs*0.8)
val_size = int(n_batchs*0.2)
# if rounding causes sizes to be less than amount of data, add spare data to the training set
total_size = train_size+val_size
if total_size < n_batchs:
train_size += n_batchs - total_size
train2 = data.take(train_size)
val2 = data.skip(train_size).take(val_size)
Thank you so much for any help!