Fine-tuning GPT2 for text summary

Hello y’all

I’m trying to create a text summary ml model by fine-tuning GPT2. And here is my current code

import tensorflow as tf
from transformers import GPT2Tokenizer
import keras_nlp
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained("gpt2")

documents= ["Hello, World", "Hello, World", "Hello, World"]
summaries = ["Good Morning", "Good Evening", "Good Day"]

#1. tokenize each string in those two lists
# here we don't need to do process such as padding and truncating as openAI didn't
documents_toknized = list(map(tokenizer, documents))
summaries_toknized = list(map(tokenizer, summaries))

#2. convert those list into Tensotflow tensor format
documents_tensor = ### I'm stuck here
summaries_tensor = ### I'm stuck here

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss)
model.fit(x=documents_tensor, y=summaries_tensor, epochs=1)

As written in this tutorial doc, I’m trying to convert documents_toknized and summaries_toknized to Tensorflow tensor type for x and y in model.fit().

Here what I should put them into x, y in model.fit().

I know I can convert each one of them like this:

input_ids = [item['input_ids'] for item in documents_toknized]
attention_mask = [item['attention_mask'] for item in documents_toknized]
input_ids_tensor = tf.convert_to_tensor(input_ids)
attention_mask_tensor = tf.convert_to_tensor(attention_mask)

But I’m wondering should I put just each input_ids_tensor to x and y or other way.
Plus how can I apply batch here

Can anyone help me with this? Thanks

Hi @Seungjun_Lee,

Sorry for the delay in response.

Here, you need to pass both input_ids (the tokenized representation of your document) and attention_mask (which tells the model which tokens are real and which are padding) as the inputs (denoted as x). The target (denoted as y) is usually the labels, which are the tokenized representation of the summaries.

To apply batching, use tf.data.Dataset. You can create a dataset from your tensors (input_ids_tensor, attention_mask_tensor, labels_tensor), then use .batch(batch_size) for batching, and .shuffle(buffer_size) for shuffling. This will allow you to train the model efficiently on batches of data and pass this dataset directly to model.fit().Kindly refer this documentation about tf.data.Dataset for more information.

Hope this helps.Thank You.