Hi folks,
I am pleased to share my latest blog post with you: Distributed Training in TensorFlow with AI Platform & Docker.
https://sayak.dev/distributed-training/
It will walk you through the steps of running distributed training in TensorFlow with AI Platform training
jobs and Docker. Below, I explain the motivation behind this blog post:
If you are conducting large-scale training it is likely that you are using a powerful remote machine via SSH access. So, even if you are not using Jupyter Notebooks, problems like SSH pipe breakage, network teardown, etc. can easily occur. Consider using a powerful virtual machine on Cloud as your remote. The problem gets far worse when there’s a connection loss but you somehow forget to turn off that virtual machine to stop consuming its resources. You get billed for practically nothing when the breakdown happens until and unless you have set up some amount of alerts and fault tolerance.
To resolve these kinds of problems, we would want to have the following things in the pipeline:
- A training workflow that is fully managed by a secure and reliable service with high availability.
- The service should automatically provision and de-provision the resources we would ask it to configure allowing us to only get charged for what’s been truly consumed.
- The service should also be very flexible. It must not introduce too much technical debt into our existing pipelines.
Happy to address any feedback.