Transformer model for language understanding with another Dataset

PK2021 · October 3, 2021, 1:53pm

Hello all!

I have been reading the official guide here (Modelo de transformador para compreensão da linguagem | Text | TensorFlow) to try and recreate the Vanilla Transformer in Tensorflow. I notice the dataset used is quite specific, and at the end of the guide, it says to try with a different dataset.

But that is where I have been stuck for a long time! I am trying to use the WMT14 dataset (as used in the original paper, Vaswani et. al.) here: wmt14_translate | TensorFlow Datasets.

I have also tried Multi30k and IWSLT dataset from spacy, but are there any guides on how I can fit the dataset to what the model requires? Specifically, to tokenize it. The official TF guide uses a pretrained tokenizer, which is specific to the PR-EN dataset given.

model_name = "ted_hrlr_translate_pt_en_converter"

I am wondering, how I can use the TF (bert) tokenizer to tokenize the Spacy dataset? I have the code for PyTorch, unfortunately I do not know how to adapt it for Tensorflow. Any help would be greatly appreciated!

 import spacy
    spacy_de = spacy.load('de')
    spacy_en = spacy.load('en')

    def tokenize_de(text):
        return [tok.text for tok in spacy_de.tokenizer(text)]

    def tokenize_en(text):
        return [tok.text for tok in spacy_en.tokenizer(text)]

    BOS_WORD = '<s>'
    EOS_WORD = '</s>'
    BLANK_WORD = "<blank>"
    SRC = data.Field(tokenize=tokenize_de, pad_token=BLANK_WORD)
    TGT = data.Field(tokenize=tokenize_en, init_token = BOS_WORD, 
                     eos_token = EOS_WORD, pad_token=BLANK_WORD)

    MAX_LEN = 100
    train, val, test = datasets.IWSLT.splits(
        exts=('.de', '.en'), fields=(SRC, TGT), 
        filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and 
            len(vars(x)['trg']) <= MAX_LEN)
    MIN_FREQ = 2
    SRC.build_vocab(train.src, min_freq=MIN_FREQ)
    TGT.build_vocab(train.trg, min_freq=MIN_FREQ)

chunduriv · September 16, 2022, 2:33pm

Please follow Subword tokenizers | Text | TensorFlow. Thank you

Topic		Replies	Views
Is there an existing tokenizer model for Chinese to English translation? General Discussion models , help_request	7	1754	September 22, 2021
Trying to use AutoTokenizer with TensorFlow gives: `ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).` General Discussion models , transformers	3	3174	January 9, 2023
General Purpose Language Translator Gemini API help_request	2	117	August 30, 2024
Can't get datasets.Dataset.to_tf_dataset() to produce tensors with right shape?! General Discussion tfkeras , help_request , tf-dataset	2	498	March 28, 2024
Text-based Tensorflow unexpected result of train_function (empty logs) General Discussion models , nlp , keras , tfdata , help_request	5	6815	July 27, 2022

Transformer model for language understanding with another Dataset

Related topics