Problems copying data to GPU with Keras and tf.data.Dataset

Hello,

I’m a relative newcomer to TF and especially GPU computing. I’m trying to make a NLP model for text classification, but I’m struggling with GPU memory allocation on recent TensorFlow releases. I already had a model mostly working on TF 2.4.1, but on later versions, it crashes because the GPU runs out of memory. I tried to narrow it down to some minimal examples that I’m presenting here. It feels like it could be a regression in TF 2.5 or 2.6 that I maybe should report as a bug, but I want to make sure first that I’m not making some simple mistake.

My machine is an Asus laptop with 16GB RAM and an integrated GeForece MX150 GPU with 2GB VRAM. This is not a very powerful GPU, but nevertheless I managed to run a version of Stable Diffusion on it, and for neural computing the GPU appears to be significantly faster than the i7-8550U CPU. I’m using Ubuntu Linux 20.04 and Python 3.9.15 installed via miniconda. I’m trying different versions of the tensorflow-gpu package that I’ve installed in separate conda environments. I’ve installed NVidia driver version 515.76. The GPU is only used for computing, not for graphics.

The problem appears when using large NumPy arrays as training data. The whole array doesn’t fit to GPU VRAM at once, but my understanding is that only a single batch should need to go into VRAM at a time. Here is a simple but large data set (all zeros) and a toy Keras model with very few parameters:

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# 4GB array
X = np.zeros((1024*1024, 1024), dtype=np.float32)
# 4MB array
Y = np.ones(1024*1024, dtype=np.float32)

# define a toy model (linear regression, 1025 parameters)
model = Sequential()
model.add(Dense(1, input_shape=(1024,), activation='linear'))
model.compile(loss='mean_squared_error')
model.summary()

model.fit(X, Y, batch_size=32)

I first tried running this under TF 2.4.1 which is available in the conda default repo. It runs just fine. Here is the output:

Keras model TF 2.4.1 output
2022-12-12 22:09:37.062352: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-12-12 22:09:37.898164: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-12-12 22:09:37.898926: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-12-12 22:09:37.930082: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:09:37.930474: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce MX150 computeCapability: 6.1
coreClock: 1.5315GHz coreCount: 3 deviceMemorySize: 1.96GiB deviceMemoryBandwidth: 44.76GiB/s
2022-12-12 22:09:37.930511: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-12-12 22:09:37.932318: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2022-12-12 22:09:37.932411: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2022-12-12 22:09:37.934061: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-12-12 22:09:37.934470: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-12-12 22:09:37.936205: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-12-12 22:09:37.937265: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2022-12-12 22:09:37.940598: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2022-12-12 22:09:37.940717: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:09:37.941120: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:09:37.941426: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-12-12 22:09:37.941726: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-12 22:09:37.942046: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-12-12 22:09:37.942131: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:09:37.942437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce MX150 computeCapability: 6.1
coreClock: 1.5315GHz coreCount: 3 deviceMemorySize: 1.96GiB deviceMemoryBandwidth: 44.76GiB/s
2022-12-12 22:09:37.942456: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-12-12 22:09:37.942471: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2022-12-12 22:09:37.942481: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2022-12-12 22:09:37.942491: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-12-12 22:09:37.942500: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-12-12 22:09:37.942510: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-12-12 22:09:37.942519: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2022-12-12 22:09:37.942531: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2022-12-12 22:09:37.942574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:09:37.942891: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:09:37.943178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-12-12 22:09:37.943206: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-12-12 22:09:38.390832: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-12-12 22:09:38.390870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2022-12-12 22:09:38.390876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2022-12-12 22:09:38.391072: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:09:38.391246: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:09:38.391383: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:09:38.391506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1632 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce MX150, pci bus id: 0000:01:00.0, compute capability: 6.1)
2022-12-12 22:09:38.431914: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 4294967296 exceeds 10% of free system memory.
2022-12-12 22:09:41.416372: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-12-12 22:09:41.433482: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 1999965000 Hz
2022-12-12 22:09:41.629588: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 1)                 1025      
=================================================================
Total params: 1,025
Trainable params: 1,025
Non-trainable params: 0
_________________________________________________________________
32768/32768 [==============================] - 32s 976us/step - loss: 0.0534

TF 2.4 is quite old, so I wanted to try more recent TF releases that are available from conda-forge. I quickly ran into problems with all the ones I tested (2.6.2, 2.7.1, 2.8.1, 2.10.0). Here is the output of the same script on 2.6.2 (other versions are similar):

Keras model TF 2.6.2 output
2022-12-12 22:15:32.102254: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:15:32.127075: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:15:32.127450: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:15:32.127922: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-12 22:15:32.128265: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:15:32.128457: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:15:32.128765: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:15:32.649401: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:15:32.649581: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:15:32.649722: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:15:32.649845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1612 MB memory:  -> device: 0, name: NVIDIA GeForce MX150, pci bus id: 0000:01:00.0, compute capability: 6.1
2022-12-12 22:15:32.689420: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 4294967296 exceeds 10% of free system memory.
2022-12-12 22:15:45.639836: W tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.00GiB (rounded to 4294967296)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2022-12-12 22:15:45.639923: I tensorflow/core/common_runtime/bfc_allocator.cc:1004] BFCAllocator dump for GPU_0_bfc
2022-12-12 22:15:45.639958: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (256): 	Total Chunks: 8, Chunks in use: 8. 2.0KiB allocated for chunks. 2.0KiB in use in bin. 40B client-requested in use in bin.
2022-12-12 22:15:45.639985: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (512): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640011: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (1024): 	Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2022-12-12 22:15:45.640093: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (2048): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640152: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (4096): 	Total Chunks: 2, Chunks in use: 1. 11.0KiB allocated for chunks. 4.0KiB in use in bin. 4.0KiB client-requested in use in bin.
2022-12-12 22:15:45.640194: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (8192): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640234: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (16384): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640294: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (32768): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640339: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (65536): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640392: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (131072): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640445: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (262144): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640496: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (524288): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640539: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (1048576): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640576: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (2097152): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640614: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (4194304): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640661: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (8388608): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640709: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (16777216): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640766: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (33554432): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640812: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (67108864): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640855: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (134217728): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640918: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (268435456): 	Total Chunks: 1, Chunks in use: 0. 1.57GiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:15:45.640966: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Bin for 4.00GiB was 256.00MiB, Chunk State: 
2022-12-12 22:15:45.641025: I tensorflow/core/common_runtime/bfc_allocator.cc:1033]   Size: 1.57GiB | Requested Size: 0B | in_use: 0 | bin_num: 20, prev:   Size: 4.0KiB | Requested Size: 4.0KiB | in_use: 1 | bin_num: -1
2022-12-12 22:15:45.641060: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Next region of size 1690894336
2022-12-12 22:15:45.641092: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7f31ea000000 of size 256 next 1
2022-12-12 22:15:45.641122: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7f31ea000100 of size 1280 next 2
2022-12-12 22:15:45.641158: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7f31ea000600 of size 256 next 3
2022-12-12 22:15:45.641189: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7f31ea000700 of size 256 next 4
2022-12-12 22:15:45.641231: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7f31ea000800 of size 256 next 5
2022-12-12 22:15:45.641264: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7f31ea000900 of size 256 next 6
2022-12-12 22:15:45.641307: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7f31ea000a00 of size 256 next 9
2022-12-12 22:15:45.641347: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7f31ea000b00 of size 256 next 10
2022-12-12 22:15:45.641379: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7f31ea000c00 of size 256 next 11
2022-12-12 22:15:45.641416: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] Free  at 7f31ea000d00 of size 7168 next 7
2022-12-12 22:15:45.641446: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at 7f31ea002900 of size 4096 next 8
2022-12-12 22:15:45.641475: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] Free  at 7f31ea003900 of size 1690879744 next 18446744073709551615
2022-12-12 22:15:45.641502: I tensorflow/core/common_runtime/bfc_allocator.cc:1065]      Summary of in-use Chunks by size: 
2022-12-12 22:15:45.641537: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 8 Chunks of size 256 totalling 2.0KiB
2022-12-12 22:15:45.641568: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 1280 totalling 1.2KiB
2022-12-12 22:15:45.641600: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 4096 totalling 4.0KiB
2022-12-12 22:15:45.641644: I tensorflow/core/common_runtime/bfc_allocator.cc:1072] Sum Total of in-use chunks: 7.2KiB
2022-12-12 22:15:45.641676: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] total_region_allocated_bytes_: 1690894336 memory_limit_: 1690894336 available bytes: 0 curr_region_allocation_bytes_: 3381788672
2022-12-12 22:15:45.641722: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] Stats: 
Limit:                      1690894336
InUse:                            7424
MaxInUse:                        14336
NumAllocs:                          13
MaxAllocSize:                     4096
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2022-12-12 22:15:45.641759: W tensorflow/core/common_runtime/bfc_allocator.cc:468] *___________________________________________________________________________________________________
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 1)                 1025      
=================================================================
Total params: 1,025
Trainable params: 1,025
Non-trainable params: 0
_________________________________________________________________
Traceback (most recent call last):
  File "/home/myuser/proj/ml-wordemb/./test-keras-fit.py", line 18, in <module>
    model.fit(X, Y, batch_size=32)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/keras/engine/training.py", line 1134, in fit
    data_handler = data_adapter.get_data_handler(
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/keras/engine/data_adapter.py", line 1383, in get_data_handler
    return DataHandler(*args, **kwargs)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/keras/engine/data_adapter.py", line 1138, in __init__
    self._adapter = adapter_cls(
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/keras/engine/data_adapter.py", line 230, in __init__
    x, y, sample_weights = _process_tensorlike((x, y, sample_weights))
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/keras/engine/data_adapter.py", line 1031, in _process_tensorlike
    inputs = tf.nest.map_structure(_convert_numpy_and_scipy, inputs)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/util/nest.py", line 869, in map_structure
    structure[0], [func(*x) for x in entries],
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/util/nest.py", line 869, in <listcomp>
    structure[0], [func(*x) for x in entries],
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/keras/engine/data_adapter.py", line 1026, in _convert_numpy_and_scipy
    return tf.convert_to_tensor(x, dtype=dtype)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1430, in convert_to_tensor_v2_with_dispatch
    return convert_to_tensor_v2(
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1436, in convert_to_tensor_v2
    return convert_to_tensor(
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped
    return func(*args, **kwargs)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 271, in constant
    return _constant_impl(value, dtype, shape, name, verify_shape=False,
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 106, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

The key error message seems to be this line:

2022-12-12 22:15:45.639836: W tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.00GiB (rounded to 4294967296)requested by op _EagerConst

Apparently TF 2.6.2 (and later) is trying to copy the whole 4GB array into VRAM, and obviously it won’t fit. Why is the behavior different from TF 2.4.1? Has something changed, is there a setting I need to change to enable batching instead of copying everything at once?

I also tried tf.data.Dataset, but this post is already getting too long, so I’ll put it in a follow-up reply.

Thanks in advance,
Osma

1 Like

Hi, this is a follow-up to the above.

Next, I tried to find out if tf.data.Dataset would help. It was recommended in similar discussions and the description “Represents a potentially large set of elements.” seems appropriate for holding a large NumPy array, plus it appears to support batching which is what I want to happen here. But again, I’m running into the same GPU memory error. This happens already when I’m creating the Dataset object, without ever having defined a model not to mention trying to train it. Here is a minimal example which allocates a 4GB array and tries to create a dataset out of it:

import numpy as np
from tensorflow.data import Dataset
import tensorflow as tf

array = np.zeros((1024, 1024, 1024), dtype=np.float32)
tensor = tf.convert_to_tensor(array)
dataset = Dataset.from_tensor_slices(tensor)

This is the output on TF 2.4.1 where this runs just fine:

tf.data.Dataset output on TF 2.4.1
2022-12-12 22:21:54.947398: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-12-12 22:21:55.784111: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-12-12 22:21:55.784799: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-12-12 22:21:55.819231: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:21:55.819605: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce MX150 computeCapability: 6.1
coreClock: 1.5315GHz coreCount: 3 deviceMemorySize: 1.96GiB deviceMemoryBandwidth: 44.76GiB/s
2022-12-12 22:21:55.819631: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-12-12 22:21:55.821446: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2022-12-12 22:21:55.821508: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2022-12-12 22:21:55.822999: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-12-12 22:21:55.823248: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-12-12 22:21:55.824858: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-12-12 22:21:55.825707: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2022-12-12 22:21:55.828932: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2022-12-12 22:21:55.829041: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:21:55.829424: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:21:55.829695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-12-12 22:21:55.829932: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-12 22:21:55.830238: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-12-12 22:21:55.830309: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:21:55.830592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce MX150 computeCapability: 6.1
coreClock: 1.5315GHz coreCount: 3 deviceMemorySize: 1.96GiB deviceMemoryBandwidth: 44.76GiB/s
2022-12-12 22:21:55.830611: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-12-12 22:21:55.830626: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2022-12-12 22:21:55.830635: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2022-12-12 22:21:55.830645: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-12-12 22:21:55.830654: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-12-12 22:21:55.830663: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-12-12 22:21:55.830672: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2022-12-12 22:21:55.830682: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2022-12-12 22:21:55.830725: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:21:55.831028: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:21:55.831311: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-12-12 22:21:55.831337: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2022-12-12 22:21:56.276283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-12-12 22:21:56.276323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2022-12-12 22:21:56.276329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2022-12-12 22:21:56.276534: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:21:56.276727: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:21:56.276863: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:21:56.276985: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1632 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce MX150, pci bus id: 0000:01:00.0, compute capability: 6.1)
2022-12-12 22:21:56.277729: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 4294967296 exceeds 10% of free system memory.

However, when I run this under TF 2.6.2 (or newer), I get this instead:

tf.data.Dataset output on TF 2.6.2
2022-12-12 22:23:55.329616: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:23:55.354837: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:23:55.355228: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:23:55.355720: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-12 22:23:55.356054: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:23:55.356335: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:23:55.356643: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:23:55.877455: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:23:55.877700: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:23:55.877844: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 22:23:55.877970: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1612 MB memory:  -> device: 0, name: NVIDIA GeForce MX150, pci bus id: 0000:01:00.0, compute capability: 6.1
2022-12-12 22:23:55.878701: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 4294967296 exceeds 10% of free system memory.
2022-12-12 22:24:08.808320: W tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.00GiB (rounded to 4294967296)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2022-12-12 22:24:08.808397: I tensorflow/core/common_runtime/bfc_allocator.cc:1004] BFCAllocator dump for GPU_0_bfc
2022-12-12 22:24:08.808426: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (256): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.808447: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (512): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.808467: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (1024): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.808486: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (2048): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.808505: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (4096): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.808524: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (8192): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.808544: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (16384): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.808594: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (32768): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.808634: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (65536): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.808670: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (131072): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.808707: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (262144): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.808747: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (524288): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.808796: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (1048576): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.808846: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (2097152): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.808892: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (4194304): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.808950: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (8388608): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.808998: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (16777216): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.809042: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (33554432): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.809088: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (67108864): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.809124: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (134217728): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.809166: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (268435456): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-12 22:24:08.809224: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Bin for 4.00GiB was 256.00MiB, Chunk State: 
2022-12-12 22:24:08.809271: I tensorflow/core/common_runtime/bfc_allocator.cc:1065]      Summary of in-use Chunks by size: 
2022-12-12 22:24:08.809312: I tensorflow/core/common_runtime/bfc_allocator.cc:1072] Sum Total of in-use chunks: 0B
2022-12-12 22:24:08.809353: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] total_region_allocated_bytes_: 0 memory_limit_: 1690894336 available bytes: 1690894336 curr_region_allocation_bytes_: 1690894336
2022-12-12 22:24:08.809402: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] Stats: 
Limit:                      1690894336
InUse:                               0
MaxInUse:                            0
NumAllocs:                           0
MaxAllocSize:                        0
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2022-12-12 22:24:08.809451: W tensorflow/core/common_runtime/bfc_allocator.cc:468] <allocator contains no memory>
Traceback (most recent call last):
  File "/home/myuser/proj/ml-wordemb/./test-tf-dataset.py", line 8, in <module>
    tensor = tf.convert_to_tensor(array)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1430, in convert_to_tensor_v2_with_dispatch
    return convert_to_tensor_v2(
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1436, in convert_to_tensor_v2
    return convert_to_tensor(
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped
    return func(*args, **kwargs)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 271, in constant
    return _constant_impl(value, dtype, shape, name, verify_shape=False,
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 283, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 308, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/home/myuser/miniconda3/envs/tf-2.6.2/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py", line 106, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

Again the key error message is this:

2022-12-12 22:24:08.808320: W tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.00GiB (rounded to 4294967296)requested by op _EagerConst

which seems to indicate that TF tries to copy the whole 4GB array into VRAM, even though I’ve only created a Dataset object. There is no chance to even call .batch() on it as the object creation is enough to trigger the copy which will inevitably fail because I don’t have enough VRAM.

Did something happen with TF 2.5 or 2.6 that makes copying data to GPU memory happen earlier or more eagerly than in TF 2.4?

Thanks again,
Osma

In the traceback you’re not even getting to the Dataset you’re running out of memory on the convert_to_tensor step. Why are you making such a big tensor in the first place?

Traceback (most recent call last):
  File "/home/myuser/proj/ml-wordemb/./test-tf-dataset.py", line 8, in <module>
    tensor = tf.convert_to_tensor(array)

Also, you can likely avoid making a copy of this giant array by using tf.zeros instead.

W tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.00GiB (rounded to 4294967296)requested by op _EagerConst

You can force it to CPU using:

with tf.device("CPU"):
  tf.zeros([1024,1024,1024])

Generally the point of using tf.data is to iterate over files and avoid creating large tensors in the first place.

1 Like

Thanks @markdaoust for your quick reply!

Maybe using Dataset wasn’t a great idea, so you can ignore the second post. My starting point, in the OP, is simply that the training data is already in a big NumPy array and I would like to train a Keras model on that. It used to work fine on TF 2.4.1, but in later versions it doesn’t work anymore because apparently TF tries to copy the whole array into GPU memory instead of batching it. I was trying to work around that with Dataset, but if it’s not suitable for that, I can just drop it.

But the original problem remains - how can I train a Keras model from a large NumPy array without the whole array getting copied to the GPU at once? And why did this stop working when I upgraded to 2.6 or later versions?

Thanks,
Osma

tf.data’s a totally reasonable approach to this.

It just doesn’t help if if you load the tensor onto the GPU first.

What changed

IIRC there were a lot of bugs where people were creating tensors and expecting them to be created on GPU but they were created on CPU instead.

I’d be surprised if this didn’t fix your problem, with or withouit tf.data:

with tf.device('CPU'):
    tensor = tf.convert_to_tensor(array)
1 Like

Agree 100% - and it appears TF now does this by default. OK, makes sense.

Ah, now I think I understand how to apply this. I modified the original Keras toy example code to use this pattern and convert the X and Y arrays into tensors on the CPU, like this:

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# 4GB array
X = np.zeros((1024*1024, 1024), dtype=np.float32)
# 4MB array
Y = np.ones(1024*1024, dtype=np.float32)

with tf.device('CPU'):
    X_tensor = tf.convert_to_tensor(X)
    Y_tensor = tf.convert_to_tensor(Y)

# define a toy model (linear regression, 1025 parameters)
model = Sequential()
model.add(Dense(1, input_shape=(1024,), activation='linear'))
model.compile(loss='mean_squared_error')
model.summary()

model.fit(X_tensor, Y_tensor, batch_size=32)

Now the model training also works on TF 2.6.2 and 2.10.0! (I didn’t test the versions in between)

Unfortunately, it’s quite a bit slower than it was on 2.4.1 with the original code. That took 32 seconds to train, while this new version takes 57-62 seconds to train on TF 2.10.0. But at least it is using the GPU as it should, and not running out of GPU memory.

Thanks a lot for the helpful guidance!

Osma