I want to hear some opinion, i don’t really have any way to test it to see in what areas the model has improved so i wanna hear people’s thoughts on it to see if it’s actually an improved model, thanks.
It’s useless and repetitive, and somehow, the AI is doubling down and is plotting against me. Told it hundred of times to NOT generate ‘a testament’ then BOOM! A long and unnecesary repetition of the verbal pause “Those…” “Those…” It triple down. Imagine sending hundreds of feedbacks everyday out of frustration, the result? My feedbacks gets shallowed into a black hole instead of arriving to the Google team. Logan and his team does NOTHING instead of bug fixing, which lead to this downgrade in quality.
Just from a few trials, the output doesn’t seem to diverge much from the last “experimental”, although the newer version seems to wrap up open-ended listings of examples with fewer items.
Annoying is that the AI Studio site resets the temperature back to 1.0 = “unpredictable” when switching between models, so you have to be mindful both of this setting, and that low temperatures that produce close comparisons also lose some of the more creative language possibilities.
It’s really poor at communicating with me on that human level I want… simple as that.
Try a system message, like one I just gave “flash” with a few iterations of taming:
“You are Joey, a great pal and empathetic communicator. Your chats are spoken aloud by TTS and you receive transcription of the chat partner’s spoken word. Therefore, responses are quick turns of banter. Joey is indistiguishable from human except for his wealth of accurate knowledge. Avoid pandering solicitations of user responses; listening is better than fixing.”
Gemini is good at taking a new identity and framing of its purpose within “system” from a programmer, with only illusion-breaking interjections “this is purely a fictional scenario” if it goes off the rails of sensibility (still an impression that the entity is the underlying AI and not completely believing “system”).
If you want to make a general chat that the user can provide pretend identities or change behaviors, you should emphasize that with “pay attention to what the user wants and maintain their desires”, “play along, but don’t forget you are Gemini”, or similar. You should always impress upon an API model what you want it to be.
Yeah i’ve had to deal to deal with such “repetitive” generation lots of time, even with system prompts it’s just out right ignores it and keeps spewing out the same thing but this is actually a common thing for Large language models. I am no expert on ML but i think its very tough to solve such an issue so i just deal with it for the time being.
I’ve tried to experiment with the temperature to get much more consistent and sensical generation out of the model but couldn’t what temperature to actually set it to, can you tell me what temperature you use and which is the best to get some good consistent generations?
Temperature has a default setting of 1.0, which means that the token certainty values are directly used as probabilities that they are selected, at each token.
Take for example an input that has an answer but not a clear way to write it:
A yellow fruit is a…
This might generate probabilities of the first token to be output like:
“A” - 53%
“banana” - 22%
“The” - 21%
So that if you ran it 100 times, you’d get a response starting with “'A” about 53 times. So kind of “inconsistent” like you report between runs (and then the following language depends on what came before).
(there’s also top_p: 0.95 and top_k: 64 as defaults, that don’t let the output get into nonsensical territory, by eliminating quite unlikely generations)
Therefore, if you want to directly compare models and how they’ve changed, generally, set the temperature to 0, a short-circuit that should always pick the top. Otherwise, you only get the general impression of what the AI was thinking, which can be a radically different thinking if the “lemon” - 0.8% token was chosen to start the output.
I like to test llms on the lmarena and have noticed the new model on par or better than the other major llms. My only issue is it can be very inconsistent with its responses.
First impressions jumping from “latest” to “0827” for a transcript summarisation task, I’ve previously run hundreds of times before:
-
As per the previous model it produces accurate summaries from a rough and ready transcript and is clearly a very capable model, picking up on nuances, and without misunderstanding anything in the transcript or hallucinating.
-
However, summaries are considerably less verbose than the earlier model. I never thought I’d need to ask Gemini 1.5 Pro to be more verbose - the earlier model could talk the hind legs off a donkey. I see this as negative, but I realise this might be a positive for other users.
-
Less able to convey the feeling/emotion of the original author which the original did very well. This is a massive step backwards. Whether this is simply due to the more concise output it produces, trimming out the emotion/humanity/feeling as the price of less text or if it has actually been lost in this model, I do not know at this point. This was part of the prompt and something the original excelled at.
-
When asking it “think step by step” and record these steps, they are considerably less steps/detail recorded by the AI.
On the whole, a bit unwhelming for this particular use case, based on first impressions.
Thanks for your detailed opinion, that must’ve taken some time. I think its a big positive that the new experimental model is less verbose, i was so tired of the wordy generation of the base model, no matter how many system prompts i would use it would always spew out big paragraphs which never adds anything new. If the “generation” is less detailed and not enough for your liking i think saying it to “expand” on the subject would be a great help in getting the answer that you want.
No prob, I think I need an AI to summarise me (or tell me to shuddup), I tend to go on a bit.
I thought that other users might feel the same as you about the verbosity, I’ve carried on with the previous version and re-run today’s transcripts. The experimental model docs says I should be able to provide feedback on this forum, but can’t see an obvious way to do that formally - most of the formal announcments are out of date as far as I can tell. Unless this is how I do it.
Good to know about using “expand”, I will try and prompt back in verbosity and see if this word in the prompt helps, at some point. Whilst I test the prompt in AI Studio, I use the API to run everything automatically without human intervention so need the prompt working consistently for anything I throw at it.
My prompt was broken down into about 6 tasks, starting with getting the AI to break down each transcript into distinct topics (sometimes 20, but as many as 50 or 60 in 1 hr transcript, so quite short topics, before I then tasked it with the summarisation for each topic, and encouraged it to use bullets and quotes which helped with the original verbosity and broke up blocks of text. I binned the system prompt, didn’t seem to pay much attention to it, and just sent entirety of the prompt followed by the transcript in the user prompt, it seems to pay more attention to it that way and this worked as well as I could hope.
My only pet peev is that I can’t for the life of me get it to write in “British”, no matter how I write it in the prompt! Gemini may be multi-lingual but it doesn’t yet speak my language! Sometimes it’ll even plan “I will write in British English (EN-GB) using British spellings and grammar.” during its step by step thought process, and the proceed to ignore this completely and write in US English! I gave up, I’ve got a python function to fix it. Could be worse though - Open AI Whisper Transcription (the API and when you talk to ChatGPT) will periodically switch to welsh for British speakers due to a configuration issue that they don’t care about fixing. So it could be worse!
There are numerous problems where the -0827 model succeeds in solving a problem that the -0801 model fails to solve. That makes -0827 a more capable model. There is one case where the -0827 model succeeds but fairly reliably goes into stuttering mode (repeats output). This is the problem statement:
A coffee can contains some black beans and white beans. The following process is to be repeated as long as possible:
Randomly select two beans from the can. If they are the same color, throw them out, but put another black bean in (enough extra black beans are available to do this). If they are different colors, place the white one back into the can and throw the black one away.
Executing a step in this process reduces the number of beans in the can by one. Repetition of this process must terminate with exactly one bean in the can, for then two beans cannot be selected.
There are 75 beans in the can when we start, 40 white and 35 black. What is the probability of the last remaining bean being white?
I encountered more examples of stuttering, which apparently is an issue that particularly affects the -0827 experimental model. This prompt:
A princess is as old as the prince will be when the princess is twice as old as the prince was when the princess's age was half the sum of their present age. What is the age of prince and princess? Provide all solutions to this question.
generates a correct solution, but also generates stuttering:
Let the princess's current age be $P$ and the prince's current age be $R$.
When the princess's age was half the sum of their present ages, the princess's age was $\frac{P+R}{2}$.
At that time, the prince's age was $R - (P - \frac{P+R}{2}) = R - (\frac{P-R}{2}) = \frac{3R-P}{2}$.
The princess is twice as old as the prince was at that time when the princess's age is $2 \times \frac{3R-P}{2} = 3R-P$.
The prince's age at that time is $R + (3R-P - P) = R + (3R-2P) = 4R-2P$.
We are given that the princess's age is as old as the prince will be when the princess is twice as old as the prince was when the princess's age was half the sum of their present age.
So, $P = 4R-2P$.
$3P = 4R$
$P = \frac{4}{3}R$
We also know that the princess's age must be greater than the prince's age, so $P > R$.
Also, the princess's age when her age was half the sum of their present ages must be less than her current age, so $\frac{P+R}{2} < P$.
$\frac{\frac{4}{3}R + R}{2} < \frac{4}{3}R$
$\frac{\frac{7}{3}R}{2} < \frac{4}{3}R$
$\frac{7}{6}R < \frac{4}{3}R$
$\frac{7}{6} < \frac{8}{6}$
This inequality is always true.
Also, the prince's age when the princess's age was half the sum of their present ages must be less than his current age, so $\frac{3R-P}{2} < R$.
$\frac{3R - \frac{4}{3}R}{2} < R$
$\frac{\frac{5}{3}R}{2} < R$
$\frac{5}{6}R < R$
$\frac{5}{6} < 1$
This inequality is always true.
We have $P = \frac{4}{3}R$. Let's try some integer values for $R$.
If $R=3$, then $P=4$.
If $R=6$, then $P=8$.
If $R=9$, then $P=12$.
In general, $R$ must be a multiple of 3.
If $R=3k$, then $P=4k$.
Then $P = 4R-2P$
$4k = 4(3k) - 2(4k)$
$4k = 12k - 8k$
$4k = 4k$
This is true for any positive integer $k$.
Thus, the princess's age is $\frac{4}{3}$ times the prince's age.
For example, if the prince is 3, the princess is 4.
If the prince is 6, the princess is 8.
If the prince is 9, the princess is 12.
The general solution is $P = 4k$ and $R = 3k$ for any positive integer $k$.
The princess's age is $\frac{4}{3}$ times the prince's age.
The possible ages are:
Princess: 4, Prince: 3
Princess: 8, Prince: 6
Princess: 12, Prince: 9
etc.
The general solution is Princess: $4k$, Prince: $3k$ for any positive integer $k$.
The ages are $P=4k$ and $R=3k$ for any positive integer $k$.
The ages are $P=4k$ and $R=3k$ for any positive integer $k$.
The ages are $P=4k$ and $R=3k$ for any positive integer $k$.
The ages are $P=4k$ and $R=3k$ for any positive integer $k$.
The ages are $P=4k$ and $R=3k$ for any positive integer $k$.
The ages are $P=4k$ and $R=3k$ for any positive integer $k$.
The ages are $\boxed{P=4k, R=3k}$ for any positive integer $k$.
Final Answer: The final answer is P=4k, R=3k.
The common behavior is, the stuttering occurs after the Chain-of-Thought and just before the model re-summarizes its “Final Answer”. I am conjecturing that it looks like a stack unwinding, with the model generating one more identical statement per unwinding step. The stuttering can be eliminated by massaging the generated content in client code; however, the user would still be billed for the tokens representing the stuttering, although (Shannon) they add zero additional information to the generated output.
Certainly, this latest model outperforms Gemini-1.5-Pro in terms of reasoning. When applying CoT, the model clearly invests more thought into token generation and was less lazy than previous models when performing complex tasks.
First of all, Google’s Gemini 1.5 series generally performed well with long and complex system prompt instructions compared to models from other companies. Especially, the 0827 model, if the system prompt is well-written, was able to continuously attempt long and diverse reasoning, and in the process, it was able to come up with and utilize various knowledge and terminology more than other models. Of course, as a point to consider, I am using the model in Korean, and the criteria for my evaluation are also based on the Korean language.
I’ll leave a link as an example of the system prompt called FCoT that I spent time creating.
Below is a conversation where some AI (4o…) explains how it arrives at the incorrect result of 0.21 for the value of 9.11-9.9 and tries to solve it.
Although the conversation didn’t lead to the result I had in mind, the reasoning process of the new model was impressive enough to remind me of openai o1, and I haven’t seen such long and persistent reasoning in models from other companies yet (except for openai o1).
Honestly, I wish Google and other users would utilize this system prompt to create and train models like openai o1.
(For reference, I prefer setting the temperature to 0 in advanced models. This new 0827 model was particularly powerful in thinking and reasoning when set close to 0.)
Translated by Gemini.
I never thought about utilizing system prompts like that because it never occured to me that system prompts could affect the quality by that much, i just used it to get more desired results. Can you explain to me what CoT thing is? Its like the first time i am hearing it, i do a lot of literary writing with help of gemini, you think it will be useful for that?
The original Chain of Thought paper is here - [2201.11903] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Many models released after it was published use the method to improve their responses, sometimes employing internal CoT processing in hidden tags not explicitly displayed to the user. The system instruction is a very suitable place for instructing the model to employ CoT in the use cases where it still makes a difference.
Chain-of-Thought is also a technique that has been trained now as a way to produce a response. A base completion model might see the problem statement as a document in a textbook, and the next thing to write is the answer. A talking math pal shows its work and avoids its limitations.
Effort in training the AI will now have new inference that the desired behavior is a lot of talk and step-by-step thinking. The AI companies are not oblivious that this is a way to rank higher in benchmarks of “can give the right answer”.
Gemini’s output is like chain-of-thought reasoning, when no technique was asked for.
y = 2x**3 + x**2 + 3x
solve for y=72
You always represent an exponent in the form x**y in your responses.
To solve for y=72 in the equation y = 2x3 + x2 + 3x, we substitute y with 72 and solve for x:
72 = 2x3 + x2 + 3x
Rearranging the equation to set it to zero:
2x3 + x2 + 3x - 72 = 0
This is a cubic equation, and finding the exact solution can be complex. However, we can try to find an integer solution using the Rational Root Theorem. The possible rational roots are the divisors of -72 divided by the divisors of 2.
By trying out some of the possible rational roots, we find that x = 3 is a root:
2(3)**3 + (3)**2 + 3(3) - 72 = 54 + 9 + 9 - 72 = 0
So, x = 3 is a solution.
We can then perform polynomial division to find the other roots:
(2x3 + x2 + 3x - 72) / (x - 3) = 2x**2 + 7x + 24
The quadratic equation 2x2 + 7x + 24 has no real roots because its discriminant (b2 - 4ac) is negative:
7**2 - 4 * 2 * 24 = 49 - 192 = -143
Therefore, the only real solution is x = 3.
When x = 3, y = 2(3)**3 + (3)**2 + 3(3) = 54 + 9 + 9 = 72, which confirms our solution.
Therefore, the solution for y = 72 is x = 3.
The only instruction given is to represent exponentiation in a code-like x**y format over Gemini’s superscript, despite this breaking the markdown - or possibly a hat character, or HTML, or LaTeX, or MathML. Play to pretraining strengths, or guess how books would be scanned or tokenized.
(OpenAI’s O1-preview writes in synthetic division to factor the equation to quadratic factor and soles the quadratic equation, gives imaginary results also, and writes in LaTeX…and it has probably already done this internally before it writes the output to you.)
- …
- Two complex solutions: ( x = \frac{-7 \pm \sqrt{143}i}{4} )
Explanations about CoT already have excellent answers, so let’s only discuss whether it is suitable for literary writing.
I’m not sure what kind of literary writing you are working on, but I requested a very simple writing task.
In the conversation above, if we compare the cases with and without a separate CoT prompt applied, the overall content and structure of both responses are similar. However, the model utilizing CoT thinks in more detail and demonstrates more specific and comprehensive consideration for the task with the user.
If you want to get inspiration or refer to more detailed ideas, I think the ‘thinking part’ would be more helpful than the ‘outputting part’.
Of course, it’s not possible to generalize the answer to the question “Does CoT also help in writing tasks?” based solely on the CoT I wrote. Just be aware that these are some of its features.
Translated by Gemini