Tokenizers update with the another corpus

I am working about MLM model. So i am training tokenizer. and model. But my data is big.(80 gb corpus) so i am training to model part by part . and tokenizer too . I am trained a tokenizer with part 1 of data. But ı cant train update with part 2 of data. what should i do ?
I have been tried a lot of codes. But I cant.
What should i do ?
Thank you for now

Hi @Talha_Ruzgar_Akkus ,

Sorry for the delayed Response. In your case, fine tuning the model on Part 2 data might be helpful. First train the tokenizer using the first half of the data, save both tokenizers and model checkpoints and

  • Load the saved checkpoints using PART1 data
model = BertForMaskedLM.from_pretrained("checkpoints",config=config)

Fine tune the model with checkpoints and new data. Please refer the code examples Ref.

Thank you