We have a deep reinforcement training routine with a paralleled evaluation process. Both the training process and the evaluation process have their own models, which are instances of the same class that is inherited from keras.model. While we encountered a strange issue:
With some model classes, the code works fine. But with another model class, the model instance in the subprocess won’t work, while another instance in the main process works well. Specifically, in the subprocess, model(input) just stuck without any reaction and any raised errors.
I don’t know whether is this a multiprocessing issue, a model defining issue, or any other potential issue. If it is a multiprocessing issue, I can’t explain why the issue won’t happen with other model classes. If it is a model-defining issue, I don’t know why the model instance in the main process works without any abnormal.
Does anyone have any ideas on this issue? Or any ideas on how to debug this issue. Many thanks!
Thank you for your comment. The phenomenon is not the same. But I feel like the root cause might be related to my issue.
In my case, the model is initialized within the main process then gets passed to the subprocess. So, the model can be initialized successfully, but can’t be called by call smoothly.
Still don’t have a clear mind on how to solve this issue. But thank you again for contributing this direction.
Sorry for this long-delayed reply… I got disrupted and went to work on other projects.
The good thing is that the direction you pointed out is correct. It is indeed an issue of misusing Keras with multiprocessing. And I’ve already solved the issue according to the material in the issues you posted! Thank you soo much for the help!