Hey folks,
I’ve seen this article from TensorFlow https://www.tensorflow.org/tutorials/keras/classification
Which does a great job explaining the details in configuring a neural network to classify 10 different labels / classes from the fashion MNIST dataset - this inspired me to design a neural network for music classification.
With the code underneath I want to feed an algorithm with two types of folders that contain two different types of music genres, then create a spectrogram for each of those audio-files, and those spectrogram-images would then be used to train the neural network, just like in the Keras classification example above. So instead of using images of 10 different fashion articles, I am using images of two different types of spectrograms. The only difference is that I want to design my neural network totally linear, so no additional relu-activated dense-layer in the middle. To keep things simple I started with just two folders, so it is a classification task to differ between just two musical genres at the moment, but my goal would be to add more genres later.
import numpy as np
import librosa
import librosa.display
import datetime
import math
import os
import tensorflow as tf
from pathlib import Path
# Spektrogram
def prepare_song(song_path):
list_matrices = []
y,sr = librosa.load(song_path,sr=22050,duration=10)
D = np.abs(librosa.stft(y))**2
S = librosa.feature.melspectrogram(S=D, sr=sr)
list_matrices.append(S)
return list_matrices
audio_tracks = []
genre = []
#Genre 1
path = '/Users/Laulito/Desktop/Samplepack der Genres/House'
pathlist = Path(path).glob('**/*.wav')
for path in pathlist:
path_in_str = str(path)
song_pieces = prepare_song(path_in_str)
audio_tracks += song_pieces
genre += ([0]*len(song_pieces)) # puts zeros into target / train--labels array
#Genre 2
path2 = '/Users/Laulito/Desktop/Samplepack der Genres/Drum & Bass'
pathlist2 = Path(path2).glob('**/*.wav')
for path2 in pathlist2:
path_in_str2 = str(path2)
song_pieces = prepare_song(path_in_str2)
audio_tracks += song_pieces
genre += ([1]*len(song_pieces)) # puts ones into target / train-labels array
# Initialise
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(np.array(audio_tracks),
np.array(genre),
test_size=0.2,
train_size=0.8,
random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_test,
y_test,
test_size=0.5,
random_state=42)
# Linear Model
from keras import datasets, layers, models
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(128, 440)), # 128x440 is the size of a spectrogram-image
tf.keras.layers.Dense(2) #Dense(2) because there are just two genres
])
model.summary()
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate=0.2,
decay_steps=15,
decay_rate=0.9)
model.compile(optimizer = tf.keras.optimizers.SGD(learning_rate=lr_schedule),
loss=tf.keras.losses.MeanSquaredError(),
metrics=[tf.keras.metrics.Accuracy()])
model.fit(x=X_train, y=y_train, epochs=5, validation_split=0.2)
model.evaluate(x=X_test, y=y_test)
That Code was bugged and stopped me at line model.fit()
, telling me in the terminal that shape (none, 1) and shape (none, 2) would be incompatible. I guess it has something to do with the last dense-layer tf.keras.layers.Dense(2)
, creating a shape of (none, 2), but the shape of my label-array was (none, 1). Which surprised me because the target in the Keras example above was also one-dimensional and the last dense-layer was of dimension 10, so their shapes would have been (none, 10) and (none, 1) …
Anyway I modified the code as follows:
a = 0
b = 1
#Genre 1
path = '/Users/Laulito/Desktop/Samplepack der Genres/House'
pathlist = Path(path).glob('**/*.wav')
for path in pathlist:
path_in_str = str(path)
song_pieces = prepare_song(path_in_str)
audio_tracks += song_pieces
array = [a,b]
genre += ([array]*len(song_pieces))
#Genre 2
path2 = '/Users/Laulito/Desktop/Samplepack der Genres/Drum & Bass'
pathlist2 = Path(path2).glob('**/*.wav')
for path2 in pathlist2:
path_in_str2 = str(path2)
song_pieces = prepare_song(path_in_str2)
audio_tracks += song_pieces
array = [b,a]
genre += ([array]*len(song_pieces))
With this change I at least now got the code running, because now the shape of genre is (none, 2) as well, but it resulted in a model where the loss was “nan” and the accuracy was 0 … I might have messed up something along the way … maybe someone can help me figure out were i went wrong