Dataset Preprocessing: Preprocess corpus for GPT Training

Abhas_Kumar · May 8, 2023, 6:25pm

Suppose I’ve a very simple Python code like this:

corpus = file.read()
            file_contents = corpus.split()[token_start : token_end]


input_tokens, output_tokens = [], []
        for i in tqdm(range(len(file_contents) - gpt_input - 1)):
            input_tokens += [file_contents[i : i + gpt_input]]
            output_tokens += [file_contents[i + gpt_input]]
               
            
        X = [' '.join(input_tokens[i]) for i in tqdm(range(len(input_tokens)))]
        Y = output_tokens

The code does three things:

Load a file into RAM, split the contents of the file into words, i.e. - we have a list of words from the file in the order of the sentences.
Next, use two variables - input_tokens, output_tokens as list and append list of first gpt_input words in input_token and gpt-input-th word in output_token. This ensures that we have all i to i + gpt_input words in input_tokens and i + 1 tokens in output_tokens, for all i = 0 to i = total_tokens - 1.
Now, we reconstruct sentences with words input_tokens, i.e. - we condensate gpt_input words back to the sentences.

Example:

If the file has contents like this:

Hello World, I'm writing a new cool code in TensorFlow, please don't forget to check it!

The end result:
input_tokens for gpt_input = 3:

Hello World, I'm
World, I'm writing
I'm writing a
writing a new
a new cool
...

output_tokens for gpt_input = 3:

writing
a
new
cool
code
...

So, now the problem is - the file or the text corpus which is needed to train a GPT Model can be very large! like upto - 200-300 GB and can’t be loaded into RAM/memory directly. So, TensorFlow offers - tf.data class, with the set of tools to help loading, caching and training from very large datasets. But the problem is that, I don’t see any way to create and pre-process text file corpus using tf.data class from the documentation. To me, it seems pretty much impossible to do. If there is any way to load corpus fragments with a window size defined by words, kindly let me know.

Thank you in advance.

Laxma_Reddy_Patlolla · May 8, 2023, 9:46pm

Hi @Abhas_Kumar,

Can you please check these following articles from TensorFlow Documentation Load text , Window , Better performance with the tf.data API , TextLineDataset to solve above Dataset Preprocessing and training of your model.

Hope this will helps you to solve your problem.

Thanks.

Abhas_Kumar · May 9, 2023, 4:18am

Thank you for your response,

But for my job these don’t suffice enough functions to work on.
TextLineDataset - loads text in the form of tensor line-by-line, some of them might include empty lines with no words too. This doesn’t solve the problem I require to get solved - to load text in the window of tokens. Though, there’s a tensor string split method offered for such, but again, streaming them to one column (or 1D-tensor) from file seems impossible.
window - method you’ve mentioned is useful but again, the preprocessing required before windowing i.e. narrowing 1D stream of words seems impossible.

Thanks for the reply.

Topic		Replies	Views
How to create tf.data.Dataset from a set of elements inside tf.data.Dataset General Discussion tfdata , help_request	1	1232	March 2, 2022
Text-based Tensorflow unexpected result of train_function (empty logs) General Discussion models , nlp , keras , tfdata , help_request	5	6817	July 27, 2022
Recommended way to save/load data to/from disk to tf.data.Dataset General Discussion tfdata	7	4400	July 19, 2023
How to fit large dataset to model? General Discussion models , datasets , help_request	7	7813	September 11, 2021
How to add multiple pre-processing steps and a post-processing step for text-classifiction model to serve via tensorflow-serving? TFX-Addons models , help_request	5	4160	December 28, 2021

Dataset Preprocessing: Preprocess corpus for GPT Training

Related topics