Hi, I recently became interested in machine learning, more specifically, neural networks.
I have gone over a bit of introductory material on the topic and I am attempting to test my understanding of it so far. The task I am attempting to accomplish is to get a model to predict the relationships between words in it’s input, however so far, it seems nothing I have tried
has yielded good results.
My short term goal is to understand how to get the model to at least be able to predict correctly for the training and validation data. Preferably with dense layers for now. I understand other layer types may be more suitable for sequential data, but since my sentences are relatively short, less than 15 words, I am thinking it should be possible with dense.
Currently, my model appears to get stuck at a relatively high loss.
Increasing, decreasing number of layers and or parameters, doesn’t seem to help much. It appears to me that most of my layers aren’t learning based on my interpretation of the histograms shown by TensorBoard, as the weight distribution appears to remain similar epoch to epoch.
Either that or it learns only on the first epoch and then barely does anything after that. I am guessing maybe it has something to do with my loss function, but I don’t see the issue yet.
Any suggestions on how I can resolve that issue?
Some information about my current setup:
Input: padded tokenized sentence.
I simply get a list of unique words for all sentences and map them to index for now.
I don’t use any punctuation in sentences.
My input data is generated artificially, as I only care about getting the model to successfully predict correctly, at least some of the training, and validation data for now.
I generate from templates similar to “Set an alarm {time} {day} to {task}”, “Remind me to {task} {time} {day} to {task}” etc.
For example, {time} can be replaced with things like, “noon”, “2pm”, “this afternoon”, etc.
When populating the templates, for each time and day, I add a mapping to the task(essentially, the task is related to time and day and vice versa. Note: for things like “this afternoon”, I only map “afternoon”.
Output: (max_input_length, max_input_length) matrix, where words(their tokens) that are related have a 1 at their intersecting indexes, otherwise 0
Model configuration: Not much significance as to why I have the activations I do, or the specific amount of dense layers,
just experimenting to see the effects of such.
input_layer = tf.keras.layers.Input(shape=(max_len,))
embedding_size = 32
embedding_layer = (tf.keras.layers.Embedding(len(vocab), embedding_size, name="embedding"))(input_layer)
hidden_layer = (tf.keras.layers.Dense(embedding_size, name="embedding_dense", activation="relu")(embedding_layer))
hidden_layer = (tf.keras.layers.Dense((max_len * max_len), activation="relu", name="dense_with_activation_1")(hidden_layer))
hidden_layer = tf.keras.layers.Flatten()(hidden_layer)
output_layer = tf.keras.layers.Dense((max_len * max_len), activation="tanh", name="dense_with_activation_2")(hidden_layer)
output_layer = tf.keras.layers.Reshape((max_len, max_len))(output_layer)
model = tf.keras.models.Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=tf.keras.optimizers.schedules.ExponentialDecay(0.1, 100, 0.96)),
loss=loss)
Loss function: The idea is to have false negatives be weighted more heavily since most relationships should be negative.
def loss(y_true, y_pred):
errors = tf.cast(tf.logical_and(tf.equal(y_true, 1), tf.not_equal(y_true, tf.round(y_pred))), dtype=tf.float32)
num_1_errors = tf.reduce_sum(errors)
original_loss = tf.abs(y_true - y_pred)
scaled_loss = tf.sqrt(num_1_errors) * original_loss
return tf.reduce_sum(scaled_loss)