I have audio files (in .wav) and their corresponding captions. I’m slowly researching and creating a model to transcribe the audio into text. The audio files are mostly short length, average of 12 seconds duration. But how would I do that? Is there a way to create a custom TF dataset that has can take audios in a column and its captions in another?
I’d format my dataset similar to others that have similar objective like the librispeech: librispeech | TensorFlow Datasets
That will help you train a model later as there are many examples already based on the the librispeech dataset
@Callum_Matthews Building a custom automatic speech recognition (ASR)/speech-to-text dataset is probably quite challenging. Would it help to look at the source code of some pre-made TensorFlow Datasets, such as Librispeech or speech_commands?
-
Dataset: librispeech | TensorFlow Datasets
-
Source code: datasets/tensorflow_datasets/audio/librispeech.py at master · tensorflow/datasets · GitHub
-
Dataset: https://www.tensorflow.org/datasets/catalog/speech_commands
-
Source code: datasets/tensorflow_datasets/audio/speech_commands.py at master · tensorflow/datasets · GitHub