Fine-tuning GPT2 for text summary

Seungjun_Lee · June 20, 2023, 3:58pm

Hello y’all

I’m trying to create a text summary ml model by fine-tuning GPT2. And here is my current code

import tensorflow as tf
from transformers import GPT2Tokenizer
import keras_nlp
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2LMHeadModel.from_pretrained("gpt2")

documents= ["Hello, World", "Hello, World", "Hello, World"]
summaries = ["Good Morning", "Good Evening", "Good Day"]

#1. tokenize each string in those two lists
# here we don't need to do process such as padding and truncating as openAI didn't
documents_toknized = list(map(tokenizer, documents))
summaries_toknized = list(map(tokenizer, summaries))

#2. convert those list into Tensotflow tensor format
documents_tensor = ### I'm stuck here
summaries_tensor = ### I'm stuck here

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss)
model.fit(x=documents_tensor, y=summaries_tensor, epochs=1)

As written in this tutorial doc, I’m trying to convert documents_toknized and summaries_toknized to Tensorflow tensor type for x and y in model.fit().

Here what I should put them into x, y in model.fit().

I know I can convert each one of them like this:

input_ids = [item['input_ids'] for item in documents_toknized]
attention_mask = [item['attention_mask'] for item in documents_toknized]
input_ids_tensor = tf.convert_to_tensor(input_ids)
attention_mask_tensor = tf.convert_to_tensor(attention_mask)

But I’m wondering should I put just each input_ids_tensor to x and y or other way.
Plus how can I apply batch here

Can anyone help me with this? Thanks

aniruthraj · December 27, 2024, 12:58pm

Hi @Seungjun_Lee,

Sorry for the delay in response.

Here, you need to pass both input_ids (the tokenized representation of your document) and attention_mask (which tells the model which tokens are real and which are padding) as the inputs (denoted as x). The target (denoted as y) is usually the labels, which are the tokenized representation of the summaries.

To apply batching, use tf.data.Dataset. You can create a dataset from your tensors (input_ids_tensor, attention_mask_tensor, labels_tensor), then use .batch(batch_size) for batching, and .shuffle(buffer_size) for shuffling. This will allow you to train the model efficiently on batches of data and pass this dataset directly to model.fit().Kindly refer this documentation about tf.data.Dataset for more information.

Hope this helps.Thank You.

Topic		Replies	Views
How to convert string to indices in a rnn model while exporting it using saved_model.save General Discussion models , nlp , datasets , help_request	1	1271	October 25, 2023
Text generator modify General Discussion help_request	1	863	January 23, 2024
MultiHeadAttention With 2 Attention Axes And An Attention Mask - How to apply mask General Discussion text-vectorization , tfvectorize	0	111	April 4, 2024
Can't get datasets.Dataset.to_tf_dataset() to produce tensors with right shape?! General Discussion tfkeras , help_request , tf-dataset	2	498	March 28, 2024
Trying to use AutoTokenizer with TensorFlow gives: `ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).` General Discussion models , transformers	3	3167	January 9, 2023

Fine-tuning GPT2 for text summary

Related topics