Hi, I’m training a model with model.fitDataset. The input dimensions are [480, 640, 3] with just 4 outputs of size [1, 4] and a batch size of 3.
Before the first onBatchEnd is called, I’m getting a High memory usage in GPU, most likely due to a memory leak warning, but the numTensors after every yield of the generator function is just ~38, the same after each onBatchEnd, so I don’t think I have a leak due to undisposed tensors.
While debugging the internals of TFjs I noticed that in acquireTexture, the numBytesInGPU goes above 2.2G, which is triggering the warning.
Is this normal behavior for that image size? It means I cannot increase my batchSize because I run out of memory with anything greater than 3.
Is there anything I can do to reduce the GPU memory usage?
Thanks @lgusm. This relates to another question I posted a few weeks ago (https://tensorflow-prod.ospodiscourse.com/t/browser-crashes-with-context-lost-webgl/3904/4), I managed to track down the root of that problem and it seems to be that I’m just running out of GPU memory during training. I track and manually dispose every single tensor I use and can confirm they never grow above ~38 (when the batch size is 3), so I’m not sure where the memory consumption is coming from. I also tried setting WEBGL_DELETE_TEXTURE_THRESHOLD to 0 to no avail.
It would be great if @Jason or someone else could shed some light on this.
Hello there! Sorry for delay have been on holiday on start of the week
Do you have a working demo of this one on Codepen.io or Glitch.com our team can use to replicate the issue? You sent one before but not sure if that is the same as this?
I also realized no one has updated the last thread too. I shall ping folk again to see if we managed to find someone with an M1 chip
PS 640x480 is quite large which is probably why you are running into memory issues here. I think our models like MobileNet want 224x224 pixel images as input for example. Doing some quick math assuming 1 byte per pixel per colour channel (8bits):
224x224x3 = 150,528 bytes = 0.14 MB per image for just the input Tensor
640x480x3 = 921,600 bytes = 0.87 MB per image for just the input Tensor
So depending on your model architecture and how large the other tensors are in your network, you could start eating up MBs pretty fast, vs 1/10ths of MBs. My guess is that this is causing the large amount of memory used but that is a guess as I do not know what model you are using etc.
Also given your batch size of 3, that means about 3MB for input tensors alone would be allocated to memory - everything else in your model architecture would also have memory allocated too so you can see how it gets large pretty fast.
You’re right, reducing the image size fixed the problem.
I was wondering if there is a way to notify the application when we are running out of memory? As I mentioned, checking the tf.memory() at batchEnd doesn’t seem to catch this state since the app crashes in between these callbacks. The warning we currently have is great for development purposes but it would be better to be able to detect this state and let the application decide whether the training should be aborted, rather than allowing the GPU to crash.
I’ve run into similar issues. Yes, making the image smaller helps, OTOH, if you have already properly accounted for any leaking tensors by checking tf.memory() after each frame, then the problem is more likely fragmentation of the TF memory allocator, or internal TF leaks.
@Jason FWIW, 640x480 is not that big, depending on your GPU. On an A100, for instance, you can easily fit like 20 of those into a single batch. If you are using a GPU back-end running with CUDA, I believe the largest texture dim is 8K. TF should have no trouble handling larger images if your GPU has enough memory. Sure, for classification, they always use small ~300x300 images, but for running kernels that generate new images, folks will definitely need to be able to run much larger frames, ideally 4K and 8K.
One approach to handling the larger frame sizes is to tile the input. Of course this only works for kernels/models that only point sample the original image, rather than ones where the kernel is a larger region in the source, like a convolution. For fixed-width kernels, like convolution, you’ll need to pad the tiles by the kernel width and then throw away the extra bits when you recombine the tiles.
This is on the front end client side however on which you do not know what GPU you will be running on. Also we get our GPU acceleration via WebGL - no CUDA on front end either. I agree if you are using Node.js on the backend then the rules are same as Python, as you know the exact environment you will run in, in advance. But on front end one must account for the GPU variability - you may be running on a smartphone, you may be on an ultrabook, you may be on a desktop with latest GPU. I am unsure what the original poster was running on but I am pretty sure it was not a A100. As web engineers we must design for every system, as the system is not fixed, to ensure most users can run it.
Good point, and he’s also training rather than predicting. My front-end TFJS code uses the NPM detect-gpu library to determine the GPU type and available GPU memory to set batch sizing.
That said, I have always been frustrated that folks that train the big base models use such small training images. Is there research to show that is sufficient and no benefits are found by increasing the source image training size? I can see how performance was a bottleneck years back, but it shouldn’t be now. Do you know of any of the major models that have been trained on larger source images?
As our users are global with varying devices not everyone will have the latest hardware though. Again, I agree, if you are lucky enough to have current gen stuff, then these things are less of an issue, but the majority of the world has something that is maybe 5 years old or so. I myself still use a Pixel 2XL for example for my personal phone and only this year upgraded my desktop (of 7 years old!) but I am still running an few years old GPU due to the current GPU shortages.
Looking at recent research papers in 2021 though folk are still resizing to lower resolutions eg:
Totally agree on ‘design for every system’… For context, we are currently training on the browser and at the time of posting the original question, I was testing on a Mac mini with M1 chip and on a Windows machine with a Radeon RX570. We expect our models to be usable on low-end computers with integrated GPUs.