I’m developing a Machine Learning “fortune teller” that completes sentences from seeds, but my trained algorithm seems very poor or even not usable.
It’s my first time training with tokenized words, and I’ve tried several approaches but non results in a good trained model.
Example of my dataset
This July money will be by your side.
The next days love will come at your door.
Be careful with your friend and money.
Love is going to be hard this year.
A lot of friends will come at your door.
Be powerful at your job it will result good this month.
... etc, etc.
I have:
- 21,885 unique sentences
- 1,523 unique words
How have I prepared my data?
I’ve sorted the unique words in alphabetic order and saved them into a 1,523-length array.
For example:
[a, at, be, by, come, door, good, job, lot, ....]
With this built array, I assign to every word of my data a numbered value.
For example:
Hello, my name is Carlos
will be equal to [54, 504, 492, 394, 100, 150]
Supposing that in my dictionary Hello=54 ,=504 my=492 name=394 is=100 Carlos=150
Why not using a pre-trained model that completes sentences? Because I have special words that don’t exist in the dictionary and weird names.
Creating my inputs and outputs
Having tokenized my words, I’ve come up with a way to make same-length inputs for my model by grouping array of words into N-length entries.
I’ve decided to split my sentences in arrays of 4-items with an array of 2-items as output, for example:
Hello, my name is {Name} and I like apples.
Will result into:
Input Output
["Hello", ",", "name", "is"] ["{Name}", "and"]
[",", "name", "is", "{Name}"] ["and", "I"]
["name", "is", "{Name}", "and"] ["I", "like"]
["is", "{Name}", "and", "I"] ["like", "apples"]
["{Name}", "and", "I", "like"] ["apples", "."]
["and", "I", "like", "apples"] [".", "."]
Note: I start appending “.” at the end of output when the sentence has finished so the algorithm has a good knowledge of when a sentence ends.
Then I repeat these steps for all the 21,885 sentences I have.
And finally I substitute the word with the index of where it’s my word list, so real input/output looks like this.
Input Output
[400,500,390,293] [303, 442]
Training
My data on this example is constructed by Inputs of 4-length arrays and outputs of 2-length arrays.
Idea is to train an algorithm that given 4 words it can predict the next 2 words. (Used 4-length array for input for this example but I’ve tried multiple array lengths for the input)
Layers and model used so far
const model = tf.sequential();
model.add(tf.layers.dense({units: 100, inputShape: [4]}));
model.add(tf.layers.activation({activation: 'softmax'}));
model.add(tf.layers.dense({units: 2}));
model.compile({loss: 'categoricalCrossentropy', optimizer: tf.train.sgd(0.001) ,metrics: ['accuracy']});
//70% of the data used for trainning
const xs = tf.tensor2d(inputs, [inputs.length, inputs[0].length]);
const ys = tf.tensor2d(outputs, [outputs.length, outputs[0].length]);
//20% of the data used for validation
const xsVal = tf.tensor2d(inputsVal, [inputsVal.length, inputsVal[0].length]);
const ysVal = tf.tensor2d(outputsVal, [outputsVal.length, outputsVal[0].length]);
model.fit(xs, ys,{epochs: 100,batchSize:64, validationData: [xsVal, ysVal]}).then(async () => {
const saveResult = await model.save('file://modelo2');
});
But I cannot get it train correctly, it loops in a step giving all the time the same loss=
value.
I’ve also tried by changing the training learning rate
, but I keep getting the same loop. Which makes me thing I’m not preparing my data correctly or I’m using a bad technique.
I’ve also tried to change the model of training to meanSquaredError
, but same problem.
Second approach
Instead of converting the output as a tokenized value of the index of my word list. E.g. Hello=54, I’ve made a another approach to only predict the next word by making a 1,523-length array of [0,0,0,...1,0]
with a 1
in the index of my word list.
For example, if I want to tokenize the output of the word website
of my list ["car", "apple", "website", "orange", "red"]
the output will be a [0, 0, 1, 0, 0]
array
So now my data looks like:
Input Output
[400,500,390,293] [0,0,0,0,0,0,0,0,0,0,1]
But this also leads to a failed training.
What am I missing or doing wrong?
I think I have to achieve something like LSTM, but I haven’t trained one as such and I’m lost about how to prepare the data for the problem I need to solve and how to prepare the layers and model.
Can anyone suggest a better approach for this?
Thanks in advance.
Note: All data and tokens for this example have been made up to explain the problem.