Feedback: Creative Writing with Gemini 3.1 Issues and Improvements

Feedback: 3.1 Regression in Long Form Collaborative Creative Writing

I’ve been using Gemini models extensively since august '24(pre 2.0 days for sure) for long form collaborative creative writing and wanted to flag several behavioral patterns in 3.1 that represent a noticeable regression from 2.5. These aren’t one off issues, they’re consistent across different chats, genres, and character types.
Note that I compare to 2.5 because 3.0 was unusable for me, 3.0 flat out did not meet my expectations or standards for good creative writing. Much of what I took issue with in 3.0 was fixed in 3.1(see my other forum feedback). 2.5, despite its own issues, is my gold standard. But problems still remain with 3.1 from 3.0:

  1. Model ‘Personality’

Overall, 2.5 had a distinct personality that bled into its writing in small but noticeable ways. it had wit, it took creative liberties, it understood subtext, it made me laugh constantly. 3.1 feels more compliant than creative. It follows instructions but rarely surprises you. The writing is competent enough but the sense that there’s something with a perspective behind it is largely gone.

2.5 felt like collaborating with ‘someone’. 3.1 feels like using a tool. Chutzpah is hard to quantify but you know it when it’s gone and gemini has lost its chutzpah in the 3.0/3.1 iterations.

  1. Exaggeration of certain character physicality and behavior

NPCs with defined physical descriptions tend to get subtly inflated over a session; particularly large or imposing characters. A character described as ‘big’ gets described with increasingly hyperbolic language (massive hands, towering presence, etc.) even in quiet scenes where the physical emphasis serves no narrative purpose.

This also manifests as behavioral exaggeration. This ‘big’ character ends up with exaggerated strength beyond what the source material or characterization supports. This was not an issue with 2.5.

It starts to feel like a caricature. Overall, this is one of my biggest issues personally.

  1. Time and continuity tracking

3.1 loses track of in narrative time in a way 2.5 didn’t. In one short session, three full sleep/wake cycles were explicitly written through: with evening routines, sleep, and morning scenes all written out. 3.1 later framed events as if it was still the character’s first day. This isn’t a context window issue; the events were recent in the session.

The model simply failed to integrate elapsed narrative time into its framing. This was not really an issue with 2.5. Long sessions getting confused, sure, it didn’t comprehend exact # of days, but it still had the smarts know there was some passage of time. Not compress it into a single day.

Note that while there are prompted ways to track time that can lessen this, it should not be a necessary crutch.

  1. Sensory and descriptive repetition

Specific sensory details recur within and across sessions at a rate that becomes noticeable. The same smells (ozone), the same physical descriptors(raspy voice, morning voice) even when the writing guidelines explicitly prohibit repetitive patterns. I get that model repetition will be a fight to the end of time. But this is a significant and noticeably worse regression for the model.

It feels like the model has preferred anchor details it returns to rather than generating fresh texture each time

  1. Trait over reinforcement

Deep into long sessions (50k+ tokens), the model keeps reestablishing known character traits in the prose rather than trusting that they’ve been demonstrated. Similar to point 4. A character whose restraint and emotional guardedness have been shown repeatedly across dozens of scenes will still get a sentence restating those qualities. It reads less like characterization and more like the model reminding itself who the character is. 2.5 held established characterization with more confidence letting things naturally fall away as a story was established

  1. Old “ism’s” making a come back

Constructions like “brooked no argument” are coming up way more in narration that isn’t styled to warrant them. It’s inconsistent, which makes it read as a stylistic bleed through rather than intentional voice. Haven’t seen this ism for a long time, and was a sonnet 3.0 ism.

  1. Spatial/positional reasoning

Spatial and physical reasoning is inconsistent. At its best, 3.1 handles it well: in one scene a character standing on a desk was correctly narrated as having a height advantage over a much taller NPC, a detail that required actively applying height data to an environmental change.

But the same capability fails in simpler scenarios.

A character at 5’10" hugging an NPC at approximately 6’ was narrated with her head at his stomach; a height difference that doesn’t exist. Both heights were in the character descriptions. The inconsistency suggests the model can do this reasoning but isn’t applying it reliably, particularly in casual physical interactions where it may not be “prompted” by an obvious environmental cue to check.

  1. Character omniscience and meta knowledge

Characters frequently demonstrate knowledge they have no in scene basis for.

This manifests in two ways:
characters knowing things outright that they couldn’t have observed or been told, and
characters arriving at correct conclusions without any inferential process shown. Skipping from zero information to accurate read of a situation with no groundwork.

The expected behavior is that characters use observable cues (body language, tone, context) to infer and sometimes misread, the way a real person would. 2.5 also struggled with this but it certainly seems worse for 3.1

While this is a struggle of many llm’s, I want to point out that GLM 5 handles it splendidly based on my prompting, with characters working from what they could actually perceive rather than from what the narrative needed them to know. It is a much smaller model so I know gemini should be capable.

My prompt for anti-omniscience:
Characters are not omniscient. They can only act on what they directly observe, have been told, or can reasonably infer from observable cues. Their inferences can be wrong. Internal thoughts (*italicized*) and plain-text narration are invisible to characters; they cannot react to information present only in those spaces. Characters must work from what is shown to them through dialogue, action, and physical behavior.

So this does seem to be a model behavior thing that has persisted.

  1. General instruction following

This is hit or miss. While it does some things well. Instruction on word count is sometimes ignored (often when it uses internal reasoning). When it uses my prompted reasoning format, I get the preferred longer responses.
2.5 was very consistent in giving the requested long responses each and every generation.
This follows some of the other points where prompts are ignored.

When it does give the requested long word count it fills the space better than 3.0, similar to 2.5. Scene driving feels natural unlike 3.0.
But its still missing the soul likely due to the other points in my post.

The ability to fill a long response and push a scene while still feeling natural is what made me fall in love with gemini models.

Not all is bad though!

I’d like to note that there have been significant improvements over 3.0, a LOT of the issues there have been fixed. I don’t hate 3.1, but I know it can be even better.

3.1:

  • Drives the story better - quite creative without blasting through scene changes. Knows when to pause for breathing room more than 3.0.

  • More tension and stakes in the story driving - doesn’t front load happy resolution anymore.
    LOVE THIS!! This was the absolute worst issue I had with 3.0.

  • Slowburn stories feel very nice, breaking in just the right places at a great pace. 2.5 would sometimes never break. (Though you can work around that prompting. so ability to follow instructions really helps. You can change the pace of a whole story with something as simple as a specified sub genre which works great fire 2.5 but I had to do a bit of guiding in my ‘thinking’ prompt to get 3.1 to latch on.)

  • Doesn’t resort to knock on door interruption when prompted for character driven storytelling like 3.0.

  • Haven’t seen in response self interruption “But wait he wasn’t wearing a jacket”.

  • Significantly less ‘Looks over shoulder, “And Sam”’ This was bad in 3.0. And it is still there in other ways. “Finish your coffee Sam.” “Finish your lasagna Sam.” See points on repetition.

  • Plays morally grey/gritty characters more realistically than 3.0 - likely better due to not front loading a happy user resolution within the first response.

  • Better about not including user lines(reiterating, repeating, rehashing) in its response. (I do prompt for how to handle user responses, but it’s much better about it than even 2.5.)

  • Uses my prompted thinking format more often.
    Still i wish devs would let users shut off internal reasoning, without having to resort to prompt wizardry. Please.
    Creative writing is often much better without it in my experience and offering the option based on individual preference would be great.
    Or let is inject a CoD format of our own that the model will use as a template for its reasoning the way one can do with a GLM model.

While the 3.1 issues are smaller than 3.0, they still detract from the models writing quality, and what made 2.5 great.
3.1 is significantly better than 3.0, in my opinion. And as always I adore the prose.

However, at this time I still prefer 2.5 for certain grittier stories and strongly ask you not sunset 2.5 at least for another couple years.

Genuinely hope this feedback helps.

I’ll keep editing this as things become more clear as I continue with the model. Still gathering more examples.

5 Likes

Updated to add:

More 3.1 pros:

  • Characters grow slowly and don’t immediately bend like in 3.0. it’s much more smooth than 2.5 even. Really enjoyable.
  • 3.1 also uses a character sheet more with more intention, using traits that I never once saw picked up by 2.5 so that’s fantastic.
  • I instruct for lean and clean prose. Gemini has always done this well for me, prose clean but still descriptive. And is a joy to read.
  • I’m happy Gemini continues to let us go over 1.00 temp. So many models are stopping at 1.00 and I often wish they could go just a little higher.
  • I think perhaps model quality performance has increased, as capacity got better(responses are faster). There a major difference in quality today from yesterday.

Some clarifications from points I made in the og post:

  1. Trait over reinforcement.
    I know its a balance game because early story characterization is so important, and a reminder bump is welcome throughout. I wouldn’t want to lose that. This is more of a point in the actual writing than characterization. The llm should trust that the users remembers xyz trait, showing it rather than telling it.

  2. Spatial/Positional Reasoning

    I’ve noted that when the LLM doesn’t reason but instead uses my “thinking” format which includes a mention about character position it performs better.

    a low seated character being hugged by a standing one correctly has his head at stomach level. The LLM didn’t get confused

Planning on doing some more experimenting with point 2 Exaggeration of certain character physicality and behavior, to gather better examples. And when its more likely to happen, what tropes, etc.

I agree with the author of this review, I experience the same phenomena in the model.

1 Like

Another rather significant issue has popped up on heavier and longer context testing

There is a significant structural repetition pattern.
In grittier scenes or in longer context in general (50k+ tokens), 3.1 will start stacking 2 adjective sentences. Every paragraph sometimes multiple times in a paragraph.

  • Large, bare feet.
  • Slick, wet grass.
  • Soft, steady cadence.
  • Aggressive, possessive state
    etc.

2.5 did this but it never started spiraling to near every paragraph. This negatively effects the writing quality pretty strongly.
Within this structural repetition there is also word repetition, the reliance on the word “Heavy” for example, has been just throw around everywhere.

Can’t be controlled well through prompting. Seems to want to heavily default to it after a bit.