Hi, so in recent while I have been researching about BERT (ALBERT specifically) and its related works and while working with them I have a few questions (which I have tried to get answer but I probably have knowledge gaps)
-
How is the preprocessing done for bert and albert alike?
So far I have been able to preprocess the text using albert_en_preprocess and sentencepiece tokenizer but its like a genie in a bottle which I don’t really understand it, its like calling function and boom its done. I skimmed through albert’s paper 1909.11942.pdf (arxiv.org) still didn’t find it. It works but I don’t get it.
Yes I did try looking it sentencepiece’s source code for a sec but even its code structure went through my head in mach 5 -
Output vectors
The output ALBERT vectors contain 2 vectors, one is pooled_output and sequence_output. The pooled_output is the sentence embedding of the dimension 1x768 and the sequence output is the token level embedding of the dimension 1x(token_length)x768
This is pretty clear about what is what. But I couldn’t find a reason for x768 fixed thingy, it probably is my lack of research at this point.
Other than that I have no problems working with the models, it would be awesome if someone with more experience can tell me the details of why, I am pretty sure the x768 will be a short and OH! I SEE THAT.
Thanks