Hello all, I have no idea where exactly to leave feedback about the new experimental models, but I thought I’d leave it here in case anyone is interested.
I have a prompt for transcript summarisation which I’ve used with all Gemini Flash, Pro & exp models to date - approximately ~2000 summaries created with the same prompt via the API.
My output is a text summary based on a whisper transcript with timestamps with a basic xml structured output. Unfortunately Gemini-exp-1206 is the first model that does not follow the structure reliably - in fact it fails reliably with the same issue. E.g. it will open an xml tag <topicts>
and close it with </topicsummary>
instead of </topicts>
- mismatching the tags. All previous Gemini models could follow this prompt. I’ve tried adding in further examples of what is valid output and what isn’t to the prompt and it will still make the same errors.
Additionally, the model will frequently hallucinate timestamps - around half the time which is problematic - I’ll end up re-running a summary several times and still having to manually repair the output - again I’ve tried giving more specific instructions and examples in the prompt to no avai - it coninues to be unreliable & I can no longer trust the output (and I could before).
Prior to 6th Dec - I was using the original Gemini-1.5-pro-001 model for shorter transcripts because it was particularly verbose and seemed convey the personality/humour of the original transcript really well - this continues to work well, but only applies to around 10% of my transcripts. For the longer transcripts (vast majority), I would use Gemini-1.5-exp-0827 which sadly has been withdrawn without a suitable replacement.
I’m about to start exploring the latest Mistral Models to see if they are able to do a better job for my use case, Gemini models are no longer suitable, unfortunately.
I did wonder if the Gemini-2.0-flash-thinking-exp-1219 might perform better, unfortunately this only returns 500 (internal error) or 429 (rate limit) if I keep trying. I’ve not yet had successful output from this model via the API.
Any suggestions / techniques / other users experiences would be great - my experiences just doesn’t match the headlines I keep hearing about these new models, so I’m reluctantly looking elsewhere.