As I got it, during the inference, when I input “The quick brown fox,” the model predicts the next word after “The,” then after “The quick,” and so on. Why does it predict tokens that are already in the input? Why doesn’t it start predicting directly after the entire input, like after “The quick brown fox”? If the model predicts a word like “tree” after “The quick brown,” do we continue with “The quick brown tree”? If not, why do we spend computational resources on these predictions?
Hey, last I checked Google doesn’t have a completions model, which seems to be what you are looking for. If you ask: “What is 2+2?” you don’t want the response to be: “What is 2+2? 2+2=4” you would instead just want the answer, and this is why it predicts words already in the input.