I have been trying to follow this guide. However, I feel like it’s severely lacking.
[Simple audio recognition: Recognizing keywords | TensorFlow Core](https://Simple audio recognition: Recognizing keywords)
First and before most, why there is no import for tensorflow-io
Reading audio files and their labels
def decode_audio(audio_binary):
audio, _ = tf.audio.decode_wav(audio_binary)
return tf.squeeze(audio, axis=-1)
Regarding this method does it allows mp3? Is possible to load mp3? What tf.squeeze actually does to the audio_binary I provided?
AUTOTUNE = tf.data.AUTOTUNE
files_ds = tf.data.Dataset.from_tensor_slices(train_files)
waveform_ds = files_ds.map(get_waveform_and_label, num_parallel_calls=AUTOTUNE)
Regarding this piece of code, I would love to know what tf.data.AUTOTUNE does.
def get_spectrogram(waveform):
# Padding for files with less than 16000 samples
zero_padding = tf.zeros([16000] - tf.shape(waveform), dtype=tf.float32)
# Concatenate audio with padding so that all audio clips will be of the
# same length
waveform = tf.cast(waveform, tf.float32)
equal_length = tf.concat([waveform, zero_padding], 0)
spectrogram = tf.signal.stft(
equal_length, frame_length=255, frame_step=128)
spectrogram = tf.abs(spectrogram)
return spectrogram
Can I create a spectogram from mp3?
What this does? zero_padding = tf.zeros([16000] - tf.shape(waveform), dtype=tf.float32)
Why we casting it to 32? waveform = tf.cast(waveform, tf.float32)
Do I need a degree in SoundPreformatted text
Engineering to use this? Cause it all seems gibberish to me.
Now the worst part, “Run inference on an audio file”.
sample_file = data_dir/‘no/01bb6a2a_nohash_0.wav’sample_ds = preprocess_dataset([str(sample_file)])
for spectrogram, label in sample_ds.batch(1):
prediction = model(spectrogram)
plt.bar(commands, tf.nn.softmax(prediction[0])) plt.title(f’Predictions for “{commands[label[0]]}”’)
plt.show()
So let me see, I generated a model and now its time to use it! So according to this guide, I need to create a dataset with just one entry do sample_ds.batch(1) because again I just have on entry and then magic I use the model I just create!
Shouldn’t instead be explicit in this tutorial how to correctly save the model(including its classes) and then how could I use this model? For example to beep every time I say a trained word with the mic or count the course of a trained word in a file mp3. As it is I don’t think I could possibly use this to turn my house lights on.
I also checked the @tensorflow-models/speech-commands, but again it’s kind of useless when you can only use a pre-defined model. Instead of explaining how to convert a given .h5 model into a JSON and then how to load it.
Sadly, I’m really disappointed;; I wanted to be able to do much more with this and I think this has a future, but as it is is really hard and unpractical to use;; My understanding of audio itself is really limited. Also, I notice even when I train this model the thresholds are ridiculous, if I say a given word not trained by the model this model will classify it in one of the classes really high, I was expecting more entropy in the classification. Again I’m no expert in this field and I was just doing this for fun.