Hi,
I’m using tf-nightly-gpu and tb-nightly and am trying to debug a model that is numerically unstable. I’d like to see where NaNs are occurring in the compute graph. So I’ve enabled debugging:
tf.debugging.experimental.enable_dump_debug_info("debug", tensor_debug_mode="FULL_HEALTH", circular_buffer_size=-1)
I run an epoch of training and then try to run tensorboard:
tensorboard --logdir=debug
I enable the Debugger V2 plugin but am always greeted with:
Debugger V2 is inactive because no data is available.
Here are the files written to the debug/ directory:
root@ce9fb22e47b0:/projects/FasterRCNN/tf2# ls debug -alh
total 195M
drwxr-xr-x 2 root root 4.0K Jan 4 18:37 .
drwxrwxr-x 7 1000 1000 4.0K Jan 4 18:40 ..
-rw-r--r-- 1 root root 5.6M Jan 4 18:37 tfdbg_events.1641320809.ce9fb22e47b0.execution
-rw-r--r-- 1 root root 109M Jan 4 18:37 tfdbg_events.1641320809.ce9fb22e47b0.graph_execution_traces
-rw-r--r-- 1 root root 18M Jan 4 18:37 tfdbg_events.1641320809.ce9fb22e47b0.graphs
-rw-r--r-- 1 root root 71 Jan 4 18:26 tfdbg_events.1641320809.ce9fb22e47b0.metadata
-rw-r--r-- 1 root root 7.1M Jan 4 18:37 tfdbg_events.1641320809.ce9fb22e47b0.source_files
-rw-r--r-- 1 root root 84K Jan 4 18:37 tfdbg_events.1641320809.ce9fb22e47b0.stack_frames
-rw-r--r-- 1 root root 3.1M Jan 4 18:40 tfdbg_events.1641321476.ce9fb22e47b0.execution
-rw-r--r-- 1 root root 18M Jan 4 18:40 tfdbg_events.1641321476.ce9fb22e47b0.graph_execution_traces
-rw-r--r-- 1 root root 29M Jan 4 18:40 tfdbg_events.1641321476.ce9fb22e47b0.graphs
-rw-r--r-- 1 root root 71 Jan 4 18:37 tfdbg_events.1641321476.ce9fb22e47b0.metadata
-rw-r--r-- 1 root root 7.1M Jan 4 18:40 tfdbg_events.1641321476.ce9fb22e47b0.source_files
-rw-r--r-- 1 root root 85K Jan 4 18:40 tfdbg_events.1641321476.ce9fb22e47b0.stack_frames
There is no indication of any other error from either TensorBoard or TensorFlow during training.
Thank you,
Bart