Tf-agents & on-bot training -- am I barking up the right tree?

Hey there. I’m working on a self-balancing robot and on a lark I thought I’d try using reinforcement learning to train the model’s control loop, rather than hand-tuning PID loops. (I’ve done the PID loop thing before, and I’m trying to get educated on RL.)

I initially tried to use Stable-Baselines3, but I found that there is not much support for running Torch models on tflite-micro, so I decided to try tf-agents. My current state is that I have the SAC demo working fine and I’ve been looking through the code to try to get a sense of what I need to do to make training on-bot viable. For the record, I’m aware that a lot of folks train in simulation first. I’d like to try training on-device instead, since I’ve seen some evidence that SAC can train much more complex skills than what I’m looking at in a relatively short time.

What I’m looking for here is a gut-check about whether the approach I’m going to take is directionally correct. From going through the SAC Minitaur demo, I gather that I need a few things to make this whole project work.

For the record, the MCU driving the bot is an ESP32-S3 and, and I’ll be training on WSL2.

  • I need to be able to send my actor models to the bot.
  • I need the bot to observe its environment, and take an action.
  • I need a reward function conditioned on the next observation. It could run either on the server or on the bot.
  • The bot needs to convey the observations, the action taken, and (if computed on bot) the reward, back to the server.

My general idea here is to make a new driver similar to PyDriver. Mostly, it seems like I can reuse the code. The only bits that will change are:

  1. At the beginning of each run(), I will compile the policy into a tflite-micro policy and send it to the bot.
  2. Instead of asking the policy for the action_step and the env for the next_time_step, I will listen to a stream of experiences from the bot and construct these objects from those experiences.
  3. When the loop finishes, I’ll tell the bot we’re finished with this training loop.

Does that all sounds basically sensible or am I totally off-base? Can you think of any other reason why this definitely won’t work? Thanks in advance for the cross check.