Hey there. I’m working on a self-balancing robot and on a lark I thought I’d try using reinforcement learning to train the model’s control loop, rather than hand-tuning PID loops. (I’ve done the PID loop thing before, and I’m trying to get educated on RL.)
I initially tried to use Stable-Baselines3, but I found that there is not much support for running Torch models on tflite-micro, so I decided to try tf-agents. My current state is that I have the SAC demo working fine and I’ve been looking through the code to try to get a sense of what I need to do to make training on-bot viable. For the record, I’m aware that a lot of folks train in simulation first. I’d like to try training on-device instead, since I’ve seen some evidence that SAC can train much more complex skills than what I’m looking at in a relatively short time.
What I’m looking for here is a gut-check about whether the approach I’m going to take is directionally correct. From going through the SAC Minitaur demo, I gather that I need a few things to make this whole project work.
For the record, the MCU driving the bot is an ESP32-S3 and, and I’ll be training on WSL2.
- I need to be able to send my actor models to the bot.
- I need the bot to observe its environment, and take an action.
- I need a reward function conditioned on the next observation. It could run either on the server or on the bot.
- The bot needs to convey the observations, the action taken, and (if computed on bot) the reward, back to the server.
My general idea here is to make a new driver similar to PyDriver
. Mostly, it seems like I can reuse the code. The only bits that will change are:
- At the beginning of each run(), I will compile the policy into a tflite-micro policy and send it to the bot.
- Instead of asking the policy for the action_step and the env for the next_time_step, I will listen to a stream of experiences from the bot and construct these objects from those experiences.
- When the loop finishes, I’ll tell the bot we’re finished with this training loop.
Does that all sounds basically sensible or am I totally off-base? Can you think of any other reason why this definitely won’t work? Thanks in advance for the cross check.