Background : We use TF1.15 to training model, and I’m so sorry we can not upgrade TF to 2.8, because we add some new function for custom tensorflow.
Question: I have a big model, and I want to increase training speed by using xla, but I meet a CUDA_ERROR_ILLEGAL_ADDRESS when I use xla, so I want to know how to solve this question.
now I have some infomations:
- I got error xla_cluster name.
- I got all inputs of error xla_cluster.
- I got all XLA HLO codes by
export XLA_FLAGS="--xla_dump_to=./xla"
For eaier debug, For easier debugging, I want a method that converts the IR code into an executable.
Debugging xla is very, very difficult, please help.
other infos:
cluster_728 have 30 ops.
The problem will not be repeated when I narrow down the cluster.
The problem will not recur when I shrink the batch_size again.
debug-file about cluster_728: module_0023.tar.gz