I have a very specific data that is unique to country and regions that I’m processing. I made a mistake by including some empty strings when generating my JSONL files for the batch processing. I’m also using structured response.
As an example:
JSONL line one might be something like
Prompt: “Do XYZ on this data {data inserted here}”
Schema: {…my data schema}
JSONL line two with the empty string might be
Prompt:”Do XYZ on this data{““}”
Schema: {…my data schema}
In the response that I get back I get a hallucination for my JSONL the key for line one has the data that you would expect given the input while the key for line two (empty string) looks like completely legit data for that specific country.
For instance, I’m working with a data from Brazil, so all the names of people in the data are common for Brazil. If I pass in an empty string there is no way I would have expected it to hallucinate names and even legit locations in Brazil when there should be no context about Brazil given that I literally passed in nothing about it. The only way this makes sense is if while processing the bulk data it’s being processed in a manner where the individual requests have shared memory.
Is there anyway I can mitigate this?