I’m trying to create a text generation model which learns from character level knowledge. I’ve managed to tokenize and build a tf dataset similar to what is present in this tutorial - आरएनएन के साथ टेक्स्ट जनरेशन | Text | TensorFlow
However instead of vectorization and having something like ids_from_chars
I’ve used unicode code points instead so a sample from the dataset obtained by the following piece of code-
for input_example, target_example in unicode_encoded_dataset.take(1):
print("Input :", input_example.numpy())
print("Target:", target_example.numpy())
Looks like this -
Input : [112 114 101 102 97 99 101 32 32 32 115 117 112 112 111 115 105 110
103 32 116 104 97 116 32 116 114 117 116 104 32 105 115 32 97 32
119 111 109 97 110 45 45 119 104 97 116 32 116 104 101 110 63 32
105 115 32 116 104 101 114 101 32 110 111 116 32 103 114 111 117 110
100 32 102 111 114 32 115 117 115 112 101 99 116 105 110 103 32 116
104 97 116 32 97 108 108 32 112 104]
Target: [114 101 102 97 99 101 32 32 32 115 117 112 112 111 115 105 110 103
32 116 104 97 116 32 116 114 117 116 104 32 105 115 32 97 32 119
111 109 97 110 45 45 119 104 97 116 32 116 104 101 110 63 32 105
115 32 116 104 101 114 101 32 110 111 116 32 103 114 111 117 110 100
32 102 111 114 32 115 117 115 112 101 99 116 105 110 103 32 116 104
97 116 32 97 108 108 32 112 104 105]
I wish to create a LSTM model similar to the one present here - Character-level text generation with LSTM
However being new to tf datasets I’m not able to figure out how to get the right shapes as I keep getting errors which I think are due to the batch sizes.
After batching and shuffling my dataset is like this -
(<PrefetchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int32, name=None), TensorSpec(shape=(64, 100), dtype=tf.int32, name=None))>,
I wish to teach the model to generate the most probable next unicode code point at each timestep limited by the max codepoint (236 in my dataset) so essentially predict between 0 to 236
model = keras.Sequential(
[
keras.Input(shape=(64, 100)),
layers.LSTM(128),
layers.Dense(237, activation="softmax"),
]
)
optimizer = keras.optimizers.RMSprop(learning_rate=0.01)
model.compile(loss="categorical_crossentropy", optimizer=optimizer)
epochs = 40
for epoch in range(epochs):
model.fit(unicode_encoded_dataset, epochs=1)
print()
print("Generating text after epoch: %d" % epoch)
I get -
ValueError: Input 0 of layer "sequential_4" is incompatible with the layer: expected shape=(None, 64, 100), found shape=(64, 100)
Can someone help me figure out how to fix this issue