Hello y’all
I’m trying to create a text summary ml model by fine-tuning GPT2. And here is my current code
import tensorflow as tf
from transformers import GPT2Tokenizer
import keras_nlp
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained("gpt2")
documents= ["Hello, World", "Hello, World", "Hello, World"]
summaries = ["Good Morning", "Good Evening", "Good Day"]
#1. tokenize each string in those two lists
# here we don't need to do process such as padding and truncating as openAI didn't
documents_toknized = list(map(tokenizer, documents))
summaries_toknized = list(map(tokenizer, summaries))
#2. convert those list into Tensotflow tensor format
documents_tensor = ### I'm stuck here
summaries_tensor = ### I'm stuck here
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss)
model.fit(x=documents_tensor, y=summaries_tensor, epochs=1)
As written in this tutorial doc, I’m trying to convert documents_toknized
and summaries_toknized
to Tensorflow tensor type
for x
and y
in model.fit()
.
Here what I should put them into x
, y
in model.fit()
.
I know I can convert each one of them like this:
input_ids = [item['input_ids'] for item in documents_toknized]
attention_mask = [item['attention_mask'] for item in documents_toknized]
input_ids_tensor = tf.convert_to_tensor(input_ids)
attention_mask_tensor = tf.convert_to_tensor(attention_mask)
But I’m wondering should I put just each input_ids_tensor to x and y or other way.
Plus how can I apply batch here
Can anyone help me with this? Thanks