Hi everyone,
I’m trying to implement a part of this paper: https://people.kth.se/~ghe/pubs/pdf/szekely2019casting.pdf
This part specifically:
Mel-spectrograms were extracted using the Librosa Python package with a window width of 20 ms and 2.5 ms hop length. The resulting spectrograms for two seconds of audio have 128×800 pixels. Zero crossing rates were calculated on the same windows. The neural network was implemented in Keras following the architecture in Figure 1. The first convolutional layer used 16 2D filters (size 3×3, stride 1×1) and ReLU nonlinearities, followed by batch normalisation and 5×4 max-pooling in both time and frequency. The second 2D convolutional layer used 8 filters in the frequency domain (4×1) and ReLU, followed by batch norm and 6×5 max pooling. Due to downsampling by the pooling layers, this produced 40 1×1 cells with 8 channels at a rate of 20 times per second. These were fed into a bidirectional LSTM layer of 8 hidden units in each direction, followed by a softmax output layer. The network was randomly initialised and trained for 40 epochs to minimise cross-entropy using Adadelta (with default parameters) batches of 16 two-second spectrogram excerpts. The softmax outputs can be interpreted as estimated per-frame class probabilities and used to automatically annotate the held-out episodes. Prior to further processing by either method, the temporal coherence of the automatic annotations was improved by merging mixed speech after a single-speaker segment into that speaker’s speech.
This is what I have :
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential(
[
keras.Input(shape=(128, 800, 2)),
layers.Conv2D(16, (3, 3), activation='relu'),
layers.BatchNormalization()
layers.MaxPooling2D(pool_size=(5, 4)),
layers.Conv2D(8, (4, 1), activation='relu'),
layers.BatchNormalization()
layers.MaxPooling2D(pool_size=(6, 5)),
layers.Bidirectional(layers.LSTM(8))
layers.Dense(7),
]
)
model.compile(
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer = keras.optimizers.Adadelta(),
metrics = ["accuracy"],
)
model.fit(x_train, y_train, epochs=40, batch_size=16)
Can someone please help?