Hello! First time actually posting on these forums, I believe. I tend to follow the general advice that you should Google for answers first. At this point, I think this one is specific enough (to my case) to warrant a unique post. I’ll show you the full warning first, then my observations/hypothesis as to what’s going on, then the details of my actual hardware and OS.
I’m getting the following warnings:
2021-07-28 15:45:36.855763: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-28 15:45:36.856682: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-07-28 15:45:36.899944: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-28 15:45:41.158493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-28 15:45:41.158534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021-07-28 15:45:41.158545: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2021-07-28 15:45:41.158768: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-28 15:45:41.159384: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-28 15:45:41.159956: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-28 15:45:41.160578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6702 MB memory) → physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
2021-07-28 15:45:41.205455: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 376320000 exceeds 10% of free system memory.
2021-07-28 15:45:41.475303: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 376320000 exceeds 10% of free system memory
2021-07-28 15:45:36.838849: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-07-28 15:45:36.839143: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-28 15:45:36.840230: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-28 15:45:36.850568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-07-28 15:45:36.852247: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-28 15:45:36.852921: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-28 15:45:36.853943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2021-07-28 15:45:36.854084: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-28 15:45:36.855763: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero2021-07-28 15:45:36.856682: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-07-28 15:45:36.899944: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-28 15:45:41.158493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-28 15:45:41.158534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021-07-28 15:45:41.158545: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2021-07-28 15:45:41.158768: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-28 15:45:41.159384: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-28 15:45:41.159956: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-28 15:45:41.160578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6702 MB memory) → physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
2021-07-28 15:45:41.205455: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 376320000 exceeds 10% of free system memory.
2021-07-28 15:45:41.475303: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 376320000 exceeds 10% of free system memory
Observations and Hypothesis
When I first hit the training loop, I’m pretty sure that it begins fine, runs, compiles, and everything. Since I have a mechanical hard drive, I can hear the head moving inside. I started it at around 4:30 PM yesterday, and let it run whilst I was asleep. When I checked it just before going to sleep, the desktop GUI (I think it’s Gnome) had crashed, but the hard drive was still making noise.
At 7AM, when I woke up, I could no longer hear the hard drive going. However, Jupyter Notebook still showed an asterisk, which is meant to indicate that the cell in question (which contained the training loop), was still running.
All of this makes me think that TF was overcommitting - as is the default behavior on Linux (?), and that for some reason, the OOM killer chose to keep TF running instead of GNOME. But given that I didn’t hear the hard drive going, I don’t think the training loop was still going. That would mean that TF crashed too. (?)
Those aforementioned warning messages are what leads me to think that overcommitting could be a factor here. I’m thinking about getting more RAM. Unfortunately my current motherboard only supports DDR3.
OS and specs
OS: Ubuntu 20.04.2 LTS
Kernel: 5.4.0-80-generic
CPU: AMD A10-6800K APU (4) @ 4.1GHz
GPU: GeForce GTX 1070
RAM: 8 GiB DDR3
Thank you for reading this far! Hope I can get some advice.