This project is a collaborative effort between myself and @deep-diver . Below, we provide a detailed overview of the project. This is our submission.
Making Stable Diffusion Your Own
The original Stable Diffusion model can generate high-quality images. However, the prompt engineering part still remains hard to get specific styles of images. If we want specificity in the generated images, fine-tuning Stable Diffusion can be quite effective. Research shows that for this to work well, only the diffusion model needs to be fine-tuned while the text encoder and image decoder can be kept frozen. This brings us to a question what if we could replace the diffusion model of the currently deployed one while keeping other parts, text encoder and image decoder as is?
To this end, this project covers two subjects: (1) how to fine-tune Stable Diffusion from KerasCV, (2) how to deploy Stable Diffusion in various ways.
We believe that by collating these two subjects, ML Practitioners and Software Engineers will have good coverage of tools needed to put Stable Diffusion in different application use cases.
Fine-tuning
If you’re looking for more specificity in terms of style, texture, and text-image alignment, fine-tuning can bring in benefits. To demonstrate this, we fine-tuned the diffusion model on a custom dataset while keeping the text and image encoders frozen.
We borrowed ideas from this tutorial by Hugging Face and faithfully reimplemented the code in TensorFlow. Our reimplementation is customizable and supports mixed-precision training along with model checkpointing.
We believe this way, practitioners will be able to repurpose Stable Diffusion for their applications even better.
Check out the stable-diffusion-keras-ft repository for more details.
Deployment
Stable Diffusion can be deployed in various ways since it primarily consists of three models (encoder/diffusion model/decoder) + some inference-time code. In this project, we cover different deployment scenarios with different platforms and frameworks including Google Kubernetes Engine and Hugging Face Endpoint with FastAPI, TensorFlow Serving, and Hugging Face custom handler. For more information, check out keras-sd-serving repository.
Different applications come with varying needs in terms of serving infrastructure, compute budget, costing, etc. This is why we believe that by decoupling the deployment from these scenarios we can devise our deployment strategies better.
1. All in one endpoint
Simply deploy Stable Diffusion as a whole to a single endpoint. In this case, you have all the pieces of code packed into a package. However, the problem with this approach is that the resources are not utilized optimally since Stable Diffusion runs three different models internally. The text encoder is good to go with CPUs, the decoder requires small-size GPUs, and the diffusion model requires much larger GPUs.
2. Three endpoints
In order to overcome this problem, you can split Stable Diffusion into three endpoints, then the client program interacts with them in a sequential manner. However, there should be time delays during the communication, encoding, decoding, and parsing procedures between clients and servers.
3. One endpoint (original, inpainting, finetuned) with local processing
Instead, you can keep encoder/decoder local while deploying the diffusion model on the cloud with heavy GPUs since that is where lots of computation happens. You could simply mix up local/cloud deployments of each part of Stable Diffusion. This flexibility brings you more benefits by letting you replace only the diffusion part with a more specialized one such as inpainting while keeping the other parts untouched.