Tokenizers update with the another corpus

Talha_Ruzgar_Akkus · April 28, 2023, 2:27am

I am working about MLM model. So i am training tokenizer. and model. But my data is big.(80 gb corpus) so i am training to model part by part . and tokenizer too . I am trained a tokenizer with part 1 of data. But ı cant train update with part 2 of data. what should i do ?
I have been tried a lot of codes. But I cant.
What should i do ?
Thank you for now

LK_Kadali · September 27, 2024, 9:49am

Hi @Talha_Ruzgar_Akkus ,

Sorry for the delayed Response. In your case, fine tuning the model on Part 2 data might be helpful. First train the tokenizer using the first half of the data, save both tokenizers and model checkpoints and

Load the saved checkpoints using PART1 data

model = BertForMaskedLM.from_pretrained("checkpoints",config=config)

Fine tune the model with checkpoints and new data. Please refer the code examples Ref.

Thank you

Topic		Replies	Views
Transformer model for language understanding with another Dataset General Discussion nlp , datasets , help_request	1	1275	September 16, 2022
Bad results when I reload a saved model General Discussion models	2	367	August 12, 2023
Save Keras Tokenizer in distributed lstm Keras keras , help_request , tf-serving	1	1321	December 12, 2023
Questions about the fine-tuning BERT TensorFlow models , nlp , help_request	1	1052	January 17, 2024
Fine-tuning a pre-trained model while replacing one of the pre-trained layers with a new PyTorchlayer General Discussion models , pytorch	1	79	March 16, 2025

Tokenizers update with the another corpus

Related topics