That tutorial shows a basic start. Doing longer sentences requires more complex models. Even if you train with longer sentences, the accuracy will probably start to drop a lot.
Following the @Bhack post above, XLSR-Wav2Vec2 model might help you.
I hope it’s published to TFHub at some point