How to fit large dataset to model?

When I have a small dataset I’m fitting my model like that:

model.fit(
train,
train_labels,
epochs=200,
validation_split=0.2,
batch_size=100,
callbacks=[es],
use_multiprocessing=True
)

but now I can’t load the whole train set in one time as it’s too large, and wonder how I can fit this train set part by part to model?
(if I only can load train_part_1, train_part_2, train_part_3 from disk separately)

Hi Ashley,

What you are trying to do is to use batch_size properly.

If you have your pipeline of data using tf.data.Dataset (tf.data.Dataset  |  TensorFlow v2.16.1) it will load the data from disk for you and provide it for the model in chunks that fit the memory. Of course the size of these chunks it’s up to you to define.

This is a great tutorial to give more insights: Better performance with the tf.data API  |  TensorFlow Core

thanks for your reply!! I tried several ways to load my data with tf.data.Dataset but no luck😿
I have all resized images saved as .npy files and I was trying these:

def map_func(feature_path):
  feature = np.load(feature_path)
  return feature

feature_paths = glob.glob('./*.np[yz]')

dataset = tf.data.Dataset.from_tensor_slices(feature_paths)

# Use map to load the numpy files in parallel
dataset = dataset.map(lambda item: tf.numpy_function(
          map_func, [item], tf.float16),
          num_parallel_calls=tf.data.AUTOTUNE)

print(dataset)

but can’t really understand how would i fit such a dataset to the model? I have labels for each .npy file in separate array but as I understand labels should be included to dataset somehow(?) because when I’m trying to add it usual way it throughs an error: ValueError: y argument is not supported when using dataset as input.

and without labels I’ve got ValueError: No gradients provided for any variable: ['dense/kernel:0', 'dense/bias:0', 'dense_1/kernel:0', 'dense_1/bias:0', 'dense_2/kernel:0', 'dense_2/bias:0'].

Could you please advise on how to add labels to the dataset properly?

What I’d do it to see how one simple model does that like this one: Load and preprocess images  |  TensorFlow Core

maybe structure the folders to have files from a specific class into a dir with that name, like the flowers dataset does, and then use the same strategy

1 Like

I tried this (though couldn’t find a way to load my .zip file of images from local disk - so had to upload it to google disk to use get_file function ), but it only allowed me to download archive, not to unzip and load ( extract=True doesn’t work in my case )

can you try uploading just a porting of the data raw (not zipped) just to test the pipeline?

1 Like

yes thanks! it’s started to work after moving files to each class directory

1 Like

perfect, glad that it worked and to be helpful

1 Like