can you try pip install asitop
and sudo asitop
to see the memory bandwidth when you are training?
When fine tunning a bert model, my M1 MAX is using around 115GB/s bandwidth. I wonder what is the number on the M1, so we can findout if it is the bandwidth or the gpu that limited the performance.