We are observing unexplained out of GPU memory events when trying to train a complex large model (involving conditional execution) and enabling XLA (jit_compile=True flag for tf.function).
Unfortunately, we haven’t been able to reproduce the issue in a reduced shareable form just yet, so I am writing here mostly for feedback.
What we see:
In general GPU utilization for an XLA compiled model goes considerably down compared to non-compiled graph mode or eager execution. This is what we measure in most of our models and in all our small test-cases.
However, in some instances, large models exceed memory capacity when compiled while they can still run in eager mode for the same exact batch size.
Both behaviors seem contradictory, so we are wondering if there are some known corner cases involving XLA that may produce this and we can avoid (we really need the extra train efficiency coming out of XLA).
Thanks Gus! The video was very informative, I had not seen this one. However it did not help explain the memory usage increase we are observing. Looking forward to hear experience from others. Thanks!
In this case performance does not seem to be the issue, but instead increased GPU memory usage. Is there a list of known problems that we may be hitting? Or maybe best practices in terms of operators? I would love to file a bug/issue on this, but as I mentioned in the original post, so far we have been unable to isolate the behavior in a shareable form.
Is there a resource (website, doc, etc) showing the steps how to visualize HLO graph ? I’m unable to find anything comprehensive except for a couple of graphviz issues posts.