Hello TensorFlow developers,
I encountered a rather strange behavior of tf.keras.preprocessing.image_dataset_from_directory
function and I was wondering if you can clarify things for me. The model I’m working with is based on this example.
In my code, I load the data like so:
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
"MyDataset",
validation_split=0.2,
subset="training",
seed=1337,
image_size=image_size,
batch_size=batch_size,)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
"MyDataset",
validation_split=0.2,
subset="validation",
seed=1337,
image_size=image_size,
batch_size=batch_size,)
I run the data loading cell above only once.
I then go on to train my model while saving the weights at each epoch. After training, I use the model’s training history to pick the weights which achieved the highest validation accuracy.
Here’s where the strange behavior occurs: I want to load the best model weights and calculate the classification metrics (accuracy, F1 etc.) of the loaded model. Here I’m copy/pasting the few Jupyter Notebook cells (and their outputs) related to this:
model = keras.models.load_model('save_at_1246.h5')
from sklearn.metrics import classification_report,confusion_matrix
import numpy as np
images_list = []
labels_list = []
for images, labels in val_ds.take(1):
images_list.append(images)
labels_list.append(labels.numpy())
predictions = np.argmax(model.predict(images_list), axis=-1)
predictions = predictions.reshape(1,-1)[0]
print("labels_list:")
print(labels_list)
print("numpy unique counts:")
print(np.unique(labels_list, return_counts=True))
print(classification_report(labels_list[0], predictions, target_names = ['1 (Class 0)','2 (Class 1)','3 (Class 2)','4 (Class 3)']))
labels_list:
[array([0, 3, 1, 1, 2, 3, 2, 2, 1, 0, 3, 1, 2, 0, 2, 1, 3, 2, 1, 1, 1, 1,
1, 1, 3, 3, 2, 1, 2, 3, 1, 1])]
numpy unique counts:
(array([0, 1, 2, 3]), array([ 3, 14, 8, 7], dtype=int64))
precision recall f1-score support
1 (Class 0) 1.00 0.67 0.80 3
2 (Class 1) 0.82 1.00 0.90 14
3 (Class 2) 0.71 0.62 0.67 8
4 (Class 3) 0.67 0.57 0.62 7
accuracy 0.78 32
macro avg 0.80 0.72 0.75 32
weighted avg 0.78 0.78 0.77 32
However, when I run the cell again (I’ll copy/paste the code again), I get:
from sklearn.metrics import classification_report,confusion_matrix
import numpy as np
images_list = []
labels_list = []
for images, labels in val_ds.take(1):
images_list.append(images)
labels_list.append(labels.numpy())
predictions = np.argmax(model.predict(images_list), axis=-1)
predictions = predictions.reshape(1,-1)[0]
print("labels_list:")
print(labels_list)
print("numpy unique counts:")
print(np.unique(labels_list, return_counts=True))
print(classification_report(labels_list[0], predictions, target_names = ['1 (Class 0)','2 (Class 1)','3 (Class 2)','4 (Class 3)']))
labels_list:
[array([3, 2, 3, 1, 2, 1, 1, 3, 2, 1, 2, 1, 1, 3, 3, 2, 2, 1, 3, 2, 0, 1,
2, 2, 2, 3, 1, 3, 0, 1, 2, 1])]
numpy unique counts:
(array([0, 1, 2, 3]), array([ 2, 11, 11, 8], dtype=int64))
precision recall f1-score support
1 (Class 0) 0.67 1.00 0.80 2
2 (Class 1) 0.83 0.91 0.87 11
3 (Class 2) 0.67 0.73 0.70 11
4 (Class 3) 0.60 0.38 0.46 8
accuracy 0.72 32
macro avg 0.69 0.75 0.71 32
weighted avg 0.71 0.72 0.70 32
Notice the discrepancy between numpy unique counts
in the output. The first one has [ 3, 14, 8, 7]
as the label distribution, while the second one has [ 2, 11, 11, 8]
as the label distribution. I did not expect this behavior. I did expect the data samples in val_ds
to be shuffled (because I didn’t provide shuffle=False
parameter to the constructor), but what bothers me is that the numpy unique counts
isn’t the same when I re-run the cell again. Mind you, I only ran the cell that creates val_ds
once.
I have two questions on this:
- Why is this happening and is there a way for me to get my desired behavior, that is, to be able to get the same data samples (albeit maybe not in the same order) with
tf.keras.preprocessing.image_dataset_from_directory
? - If
tf.keras.preprocessing.image_dataset_from_directory
works the way I described here, does it mean that during training withmodel.fit()
there’s an overlap between training and validation datasets?
Thank you in advance!