NLP translator data preparation

Nerd_Corner · May 2, 2023, 1:46am

I am new to this forum and unsure if anyone is willing to help, but I want to give it a try.

My goal is a translator from german to english and vise versa. I have a large dataset of 152000 samples (words and sentences).

I want to use a sequence length of 60. My input vocabSize is 30944 and my output vocabSize is 14849.

If I tokenize and pad my sequences I get something like [[11,0,44,2322,23111,…],[444,22,11113,4456,…]]. It would be easy to create an input tensor with tf.tensor2d(paddedInputs); The shape would be [152000,60].

But I read that a NLP translator model needs a 3d tensor of shape [batchSize, sequenceLength, vocabSize].

2 Questions:

Is it really true that the tensor has to be 3d instead of 2d? Why?
How do I create the 3d tensor with my 2d paddedInput?

Nerd_Corner · June 17, 2023, 8:06am

Okay, I think I managed to solve the issues myself and I documented all my steps:
Part 1
Part 2

But I still can’t figure out why my model is not working in Tensorflow.js only in Tensorflow? Can anyone explain? I also included the source code in the link of part 2!

Topic		Replies	Views
Trying to create language translator TensorFlow language-translator , help_request	3	1438	December 26, 2022
Failed to convert a NumPy array to a Tensor General Discussion nlp , keras , help_request	7	17285	June 3, 2024
Transformer model for language understanding with another Dataset General Discussion nlp , datasets , help_request	1	1257	September 16, 2022
To reshape data into 3D General Discussion tf-ops	1	560	May 4, 2023
Is there an existing tokenizer model for Chinese to English translation? General Discussion models , help_request	7	1753	September 22, 2021

NLP translator data preparation

Related topics