Suppose I’ve a very simple Python code like this:
corpus = file.read()
file_contents = corpus.split()[token_start : token_end]
input_tokens, output_tokens = [], []
for i in tqdm(range(len(file_contents) - gpt_input - 1)):
input_tokens += [file_contents[i : i + gpt_input]]
output_tokens += [file_contents[i + gpt_input]]
X = [' '.join(input_tokens[i]) for i in tqdm(range(len(input_tokens)))]
Y = output_tokens
The code does three things:
- Load a file into RAM, split the contents of the file into words, i.e. - we have a list of words from the file in the order of the sentences.
- Next, use two variables - input_tokens, output_tokens as list and append list of first
gpt_input
words in input_token andgpt-input
-th word in output_token. This ensures that we have alli
toi + gpt_input
words in input_tokens andi + 1
tokens in output_tokens, for all i = 0 to i =total_tokens - 1
. - Now, we reconstruct sentences with words input_tokens, i.e. - we condensate gpt_input words back to the sentences.
Example:
If the file has contents like this:
Hello World, I'm writing a new cool code in TensorFlow, please don't forget to check it!
The end result:
input_tokens for gpt_input = 3:
Hello World, I'm
World, I'm writing
I'm writing a
writing a new
a new cool
...
output_tokens for gpt_input = 3:
writing
a
new
cool
code
...
So, now the problem is - the file or the text corpus which is needed to train a GPT Model can be very large! like upto - 200-300 GB and can’t be loaded into RAM/memory directly. So, TensorFlow offers - tf.data class, with the set of tools to help loading, caching and training from very large datasets. But the problem is that, I don’t see any way to create and pre-process text file corpus using tf.data class from the documentation. To me, it seems pretty much impossible to do. If there is any way to load corpus fragments with a window size defined by words, kindly let me know.
Thank you in advance.