Fine-tuning speech to text model

Callum_Matthews · October 26, 2021, 8:01am

I have a large dataset with 25-45 second audio files with their transcriptions in a low-resource language (has common vocabulary to English in some ways), and I want to fine-tune an existing model against my own data. All tutorials I find use Common Voice, and tailoring them to my use-case isn’t very straightforward. This is the tutorial I tried following; it uses torchaudio but I would prefer tensorflow: Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers

I was wondering if there were any references I could follow. My dataset is simply two columns; a ‘filename’ column which is just the audio file (and I concatenate the path to each filename when I want to load the audio file) and a ‘sentence’ column which is the audio transcription. Audio is in MP3 format.

I’m really stuck on how to proceed from here.

Ashwini_Gadag · November 30, 2021, 7:38am

There is a tutorial on Simple audio recognition that can be good reference to start the speech to text using Tensorflow.

Topic		Replies	Views
Creating dataset with audio and its captions General Discussion models , help_request	2	889	October 7, 2021
Natural Language Processing - speech synthesis General Discussion help_request	1	369	August 22, 2024
How do I build a custom voice recognition model for multiple people? TF.js tfjs , datasets , help_request	24	7119	September 18, 2021
Simple audio recognition: Recognizing keywords \| TensorFlow Core General Discussion models , help_request , tfcore	8	3676	December 28, 2022
[Voice Recognition] How can I use the model? General Discussion models , help_request	1	659	October 6, 2021

Fine-tuning speech to text model

Related topics