Error While training: Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

Two issues–not sure if they are related.
First issue is that when I print out the accuracy using a confusion matrix (46.5%), it is different from the model.fit accuracy (~89%).
All the segments of code using training data:

train_ds = tf.data.Dataset.from_generator(FrameGenerator(subset_paths['train'], n_frames, training=True),
                                          output_signature = output_signature)
# Batch the data
train_ds = train_ds.batch(batch_size)
frames, label = next(iter(train_ds))

history = model.fit(x = train_ds,
                    epochs = 50, 
                    validation_data = val_ds)

At the same time I am also seeing some errors in the training. I have tried changing the batch size without any success.

Epoch 1/50
     17/Unknown 47s 2s/step - accuracy: 0.3868 - loss: 1.77962024-08-07 16:13:27.110801: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node IteratorGetNext}}]]
/Users/maycaj/anaconda3/lib/python3.11/contextlib.py:155: UserWarning: Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.
  self.gen.throw(typ, value, traceback)
2024-08-07 16:13:30.584240: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node IteratorGetNext}}]]
17/17 ━━━━━━━━━━━━━━━━━━━━ 51s 3s/step - accuracy: 0.3907 - loss: 1.7470 - val_accuracy: 0.6000 - val_loss: 0.6804
Epoch 2/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 3s/step - accuracy: 0.4261 - loss: 0.70892024-08-07 16:14:17.404221: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node IteratorGetNext}}]]
17/17 ━━━━━━━━━━━━━━━━━━━━ 47s 3s/step - accuracy: 0.4296 - loss: 0.7084 - val_accuracy: 0.6000 - val_loss: 0.6733
Epoch 3/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.5752 - loss: 0.6670 - val_accuracy: 0.3778 - val_loss: 0.7342
Epoch 4/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 2s/step - accuracy: 0.4989 - loss: 0.73222024-08-07 16:15:48.459924: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node IteratorGetNext}}]]
17/17 ━━━━━━━━━━━━━━━━━━━━ 45s 3s/step - accuracy: 0.5005 - loss: 0.7307 - val_accuracy: 0.6000 - val_loss: 0.6651
Epoch 5/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 45s 3s/step - accuracy: 0.5763 - loss: 0.6714 - val_accuracy: 0.6000 - val_loss: 0.7116
Epoch 6/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 45s 3s/step - accuracy: 0.6061 - loss: 0.6694 - val_accuracy: 0.6000 - val_loss: 0.9239
Epoch 7/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.5772 - loss: 0.7055 - val_accuracy: 0.6000 - val_loss: 0.8254
Epoch 8/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 2s/step - accuracy: 0.4682 - loss: 0.84012024-08-07 16:18:50.161854: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node IteratorGetNext}}]]
17/17 ━━━━━━━━━━━━━━━━━━━━ 45s 3s/step - accuracy: 0.4659 - loss: 0.8379 - val_accuracy: 0.6000 - val_loss: 0.7225
Epoch 9/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 45s 3s/step - accuracy: 0.6680 - loss: 0.6395 - val_accuracy: 0.6000 - val_loss: 0.6922
Epoch 10/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.5701 - loss: 0.6770 - val_accuracy: 0.6000 - val_loss: 0.6552
Epoch 11/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.6319 - loss: 0.6595 - val_accuracy: 0.6000 - val_loss: 0.9322
Epoch 12/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.5416 - loss: 0.6640 - val_accuracy: 0.6000 - val_loss: 0.6799
Epoch 13/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.6133 - loss: 0.6387 - val_accuracy: 0.6000 - val_loss: 0.8982
Epoch 14/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.4841 - loss: 0.7464 - val_accuracy: 0.6000 - val_loss: 1.0706
Epoch 15/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.5673 - loss: 0.7374 - val_accuracy: 0.6000 - val_loss: 0.7965
Epoch 16/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 2s/step - accuracy: 0.5752 - loss: 0.64622024-08-07 16:24:55.453286: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node IteratorGetNext}}]]
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.5785 - loss: 0.6447 - val_accuracy: 0.6000 - val_loss: 0.8740
Epoch 17/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.6149 - loss: 0.6220 - val_accuracy: 0.6000 - val_loss: 0.9858
Epoch 18/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.5753 - loss: 0.6361 - val_accuracy: 0.6000 - val_loss: 1.0021
Epoch 19/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.6128 - loss: 0.6452 - val_accuracy: 0.6000 - val_loss: 0.7140
Epoch 20/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.7083 - loss: 0.5956 - val_accuracy: 0.6000 - val_loss: 0.6552
Epoch 21/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.6937 - loss: 0.6198 - val_accuracy: 0.7333 - val_loss: 0.5785
Epoch 22/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.5830 - loss: 0.6744 - val_accuracy: 0.6000 - val_loss: 0.6864
Epoch 23/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.6677 - loss: 0.6101 - val_accuracy: 0.6000 - val_loss: 0.7047
Epoch 24/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.6521 - loss: 0.5952 - val_accuracy: 0.6000 - val_loss: 0.7624
Epoch 25/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 50s 3s/step - accuracy: 0.6431 - loss: 0.5888 - val_accuracy: 0.6889 - val_loss: 0.5465
Epoch 26/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 46s 3s/step - accuracy: 0.6694 - loss: 0.5774 - val_accuracy: 0.6000 - val_loss: 0.7985
Epoch 27/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 45s 3s/step - accuracy: 0.6879 - loss: 0.6017 - val_accuracy: 0.6667 - val_loss: 0.5805
Epoch 28/50
17/17 ━━━━━━━━━━━━━━━━━━━━ 45s 3s/step - accuracy: 0.8072 - loss: 0.5052 - val_accuracy: 0.6000 - val_loss: 0.9059
Epoch 29/50

Hi @cams, Could you please let us know which dataset(test,validation,train) you are using to calculate the accuracy using the confusion matrix. And the model.fit accuracy is the accuracy you got is on training or validation accuracy.

The warnings you get during the training are related to the training dataset, which specify that the data present in the dataset was exhausted before the expected number of steps per epoch was completed.

To overcome this warning you can use the repeat method on the dataset which will generate a infinite dataset and this allows the generator to keep providing data even after completing the one pass through the data or you can manually specify the value to the steps_per_epoch argument in fit method which you can calculate by using len(dataset)//batch_size. Thank You.

Hello,
I am following the https://www.tensorflow.org/tutorials/video/video_classification tutorial
I am using the UCF101 dataset:

URL = 'https://storage.googleapis.com/thumos14_files/UCF101_videos.zip'
download_dir = pathlib.Path('./UCF101_subset/')
subset_paths = download_ufc_101_subset(URL, 
                        num_classes = num_classes, # modified
                        splits = {"train": train_split, "val": 10, "test": 10}, # why does this download close to the number asked for, but not exactly?
                        download_dir = download_dir)

Both the accuracy on the model.fit accuracy and the confusion matrix are from the training data.
Model.fit:

history = model.fit(x = train_ds, 
                    epochs = 60, 
                    validation_data = val_ds)

def plot_history(history):
    """
    Plotting training and validation learning curves.

    Args:
        history: model history with all the metric measures
    """
    fig, (ax1, ax2) = plt.subplots(2)

    fig.set_size_inches(18.5, 10.5)

    # Plot loss
    ax1.set_title('Loss')
    ax1.plot(history.history['loss'], label = 'train')
    ax1.plot(history.history['val_loss'], label = 'test')
    ax1.set_ylabel('Loss')

    # Determine upper bound of y-axis
    max_loss = max(history.history['loss'] + history.history['val_loss'])

    ax1.set_ylim([0, np.ceil(max_loss)])
    ax1.set_xlabel('Epoch')
    ax1.legend(['Train', 'Validation']) 

    # Plot accuracy
    ax2.set_title('Accuracy')
    ax2.plot(history.history['accuracy'],  label = 'train')
    ax2.plot(history.history['val_accuracy'], label = 'test')
    ax2.set_ylabel('Accuracy')
    ax2.set_ylim([0, 1])
    ax2.set_xlabel('Epoch')
    ax2.legend(['Train', 'Validation'])

    plt.show()

plot_history(history)
def plot_confusion_matrix(actual, predicted, labels, ds_type):
    plt.figure()
    cm = tf.math.confusion_matrix(actual, predicted)
    ax = sns.heatmap(cm, annot=True, fmt='g')
    # sns.set(rc={'figure.figsize':(12, 12)})
    ax.set_title('Confusion matrix of action recognition for ' + ds_type)
    ax.set_xlabel('Predicted Action')
    ax.set_ylabel('Actual Action')
    plt.xticks(rotation=90)
    plt.yticks(rotation=0)
    ax.xaxis.set_ticklabels(labels)
    ax.yaxis.set_ticklabels(labels)
    plt.show()

fg = FrameGenerator(subset_paths['train'], n_frames, training=True)
labels = list(fg.class_ids_for_name.keys())

actual, predicted = get_actual_predicted_labels(train_ds)
plot_confusion_matrix(actual, predicted, labels, 'training')

Hi @cams, If see the code for getting actual and predicted values on train dataset

def get_actual_predicted_labels(dataset): 

  actual = [labels for _, labels in dataset.unbatch()]
  predicted = model.predict(dataset)

  actual = tf.stack(actual, axis=0)
  predicted = tf.concat(predicted, axis=0)
  predicted = tf.argmax(predicted, axis=1)

  return actual, predicted

the model predictions are taken on the batch dataset and actual labels are from unbatch dataset that might be the reason for getting the accuracy difference between model.fit and confusion matrix accuracy. Could please try by making predictions on unbatched dataset and calculate the accuracy using actual and predicted labels. Thank You.

Hello,
I determined that the issue was due to training=True. in the original code:

When training = True, the pairs are shuffled using random.shuffle(pairs) and this causes an issue with plotting the confusion matrix.

class FrameGenerator:
  def __init__(self, path, n_frames, training = False):
    """ Returns a set of frames with their associated label. 

      Args:
        path: Video file paths.
        n_frames: Number of frames. 
        training: Boolean to determine if training dataset is being created.
    """
    self.path = path
    self.n_frames = n_frames
    self.training = training
    self.class_names = sorted(set(p.name for p in self.path.iterdir() if p.is_dir()))
    self.class_ids_for_name = dict((name, idx) for idx, name in enumerate(self.class_names))

  def get_files_and_class_names(self):
    video_paths = list(self.path.glob('*/*.avi'))
    classes = [p.parent.name for p in video_paths] 
    return video_paths, classes

  def __call__(self):
    video_paths, classes = self.get_files_and_class_names()

    pairs = list(zip(video_paths, classes))

    if self.training:
      random.shuffle(pairs)

    for path, name in pairs:
      video_frames = frames_from_video_file(path, self.n_frames) 
      label = self.class_ids_for_name[name] # Encode labels
      yield video_frames, label

Modified code that works:
(training=False)

train_ds_not_shuffled = tf.data.Dataset.from_generator(FrameGenerator(subset_paths['train'], n_frames, training=False), # modified
                                          output_signature = output_signature)

train_ds_not_shuffled = train_ds_not_shuffled.batch(batch_size) # modified

actual, predicted = get_actual_predicted_labels(train_ds_not_shuffled)
plot_confusion_matrix(actual, predicted, labels, 'training')