I’m trying to do sensitivity analysis (forward mode autodiff) on a matrix and was hoping to parallelise the computations using Tensorflow. Here’s the code that I’m using to test if something like this is possible in TF:
In the code above, the evaluation of Z1 and dZ1 is independent of each other (same thing for A1, dA1, etc. etc.) so I was hoping to run these pair of statements in parallel. I wrapped this function around tf.function and was hoping for speedup as compared to standard way of computing gradients (forward and backprop) because now I’ll be running half the calculations in parallel. However, I don’t see any speedups and both codes take the same time to execute.
I don’t know if it’s possible to do what I’m trying here. Any help would be appreciated.
Yes I think that is a good point - if all the computations are assigned to GPU and all are being executed on a single stream then you won’t get a speedup from additional parallelism. What I’d do here normally is a bit low level (that’s the area I’m normally ) and dump the graph to see if there are any unexpected edges that are inhibiting control flow (these days I’d dump the Graphdef, convert to TFG (tfg-translate tool) and then look at the output as it is readable, before TFG I’d pipe it through to Graphviz file) and then run with vmodule=executor=1 to see exactly what’s being run and where (it can produce a lot of output even for small graphs).
I don’t know who from TFRT team is on here to ping in for comment, let me check
TF still has a single compute stream (transfers and NCCL ops each use a different stream) per session. This is not trivial to change because the GPU memory allocator implicitly synchronizes all allocations on that compute stream.
We are implementing multi-stream support in the HLO to TFRT compiler, but that’s not ready yet.
It is a little bit hard today, with all the moving compilers and runtime WIP parts, to really understand what kind of code it is really going to be produced and scheduled on the hw.
I think that the GAP It is still too large between people working every day on compilers+runtimes vs people that are just trying to figure out performance (gap?) on their high level API compositional path.
I hope that we can try to improve the current status reducing this gap having more usability oriented documentation and tools on the high level (user) side of the spectrum.
If not It will be really hard to interact on a common playground about the performance.
Yes and no: it can be but that visualization is meant to be more ML practisioners and can elide control edges that obscures the model structure but are important for folks debugging scheduling questions.
You are correct. There is a workstream ongoing around performance predictabilty (well predictabilty in general) here. But adjacent to that is what you mention about communication gap here today at technical level.