Dear TensorFlow Community,
My model trains fine on the GPU with a dataset containing 25 hours of audio. However, when I use a 200-hour audio dataset, I am encountering the following error:
E tensorflow/compiler/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:306] gpu_async_0 cuMemAllocAsync failed to allocate 957465616 bytes: CUDA error: out of memory (CUDA_ERROR_OUT_OF_MEMORY)
I am currently using TensorFlow 2.14.0, Python 3.10, CUDA 11.8, CuDNN 8.7.0.84, Nvidia driver 535.129.03, Ubuntu 22.04.03.
GPU Name | Persistence-M | Bus-Id | Disp.A | Volatile Uncorr. ECC | ||
---|---|---|---|---|---|---|
NVIDIA GeForce RTX 4090 | Off | 00000000:01:00.0 | On | Off | ||
GPU FAN 42% | 68°C | P2 | 328W / 450W | GPU Memory-Usage 16491MiB / 24564MiB | GPU-Util 100% | Default |
Processes | GPU | GI | CI | PID | Type | Process name | GPU Memory |
---|---|---|---|---|---|---|---|
0 | N/A | N/A | 2158 | G | /usr/lib/xorg/Xorg | 392MiB | |
0 | N/A | N/A | 2333 | G | /usr/bin/gnome-shell | 62MiB | |
0 | N/A | N/A | 3625 | G | /usr/lib/firefox/firefox | 165MiB | |
0 | N/A | N/A | 5937 | G | SpareRendererForSitePerProcess | 112MiB | |
0 | N/A | N/A | 18689 | G | /usr/bin/nvidia-settings | 0MiB | |
0 | N/A | N/A | 20854 | G | gnome-control-center | 6MiB | |
0 | N/A | N/A | 29875 | C | /usr/bin/python | 15718MiB |
I have tried decreasing the complexity of the model (from 60M to 20M parameters) and reducing the batch size. When I reduced the batch size from 48 to 16, the error occurred a few minutes later. I suspect that something might be wrong with my tf.data.Dataset input pipeline. Here is a snippet of the dataset input pipeline:
batch_size = 32
num_epochs = 20
buffer_size = 1000
# Training dataset
train_dataset = tf.data.Dataset.from_tensor_slices(
( list(dataframe_training["wav_file_name"]), list(dataframe_training["transcription"]) )
)
train_dataset = (
train_dataset.map(feature_extractor, num_parallel_calls=tf.data.AUTOTUNE)
.shuffle(buffer_size)
.padded_batch(batch_size)
.prefetch(buffer_size=tf.data.AUTOTUNE)
)
# Validation dataset
validation_dataset = tf.data.Dataset.from_tensor_slices(
( list(dataframe_validation["wav_file_name"]), list(dataframe_validation["transcription"]) )
)
validation_dataset = (
validation_dataset.map(feature_extractor, num_parallel_calls=tf.data.AUTOTUNE)
.shuffle(buffer_size)
.padded_batch(batch_size)
.prefetch(buffer_size=tf.data.AUTOTUNE)
)
Are the ‘train_dataset’ and ‘validation_dataset’ not copied in batches to the GPU?