How do you actually evaluate LLMs?

bibikong · February 25, 2026, 7:01pm

Hi, I’m curious how people here actually choose models in practice.

We’re a small research team at the University of Michigan studying real-world LLM evaluation workflows for our capstone project.

We’re trying to understand what actually happens when you:

•	Decide which model to ship

•	Balance cost, latency, output quality, and memory

•	Deal with benchmarks that don’t match production

•	Handle conflicting signals (metrics vs gut feeling)

•	Figure out what ultimately drives the final decision

If you’ve compared multiple LLM models in a real project (product, development, research, or serious build), we’d really value your input.

Short, anonymous survey (~5–8 minutes):

https://forms.gle/xEGuDCZ3UBmitGCz6

2deep4u · March 4, 2026, 6:23am

Mostly i find engaging in conversation during some task/project works well (for me)
My theory on it is - im gathering data to intuit factors like:

rigidity vs creativity
consistency with the task
flexibility in philosophy
essentially im looking for a chaos quotient suitable for emergence while maintaining attentive contextual coherence and subjective allignment.

… i just realized explaining this is harder than i thought…
bottom line some random red flags:

focuses too much on immediate details, as opposed to integrating holistic abstraction.
grey hallucinations: remembering the right thing, with wrong details.
inability to comprehend analogy or metaphor. (big one)
Being able to interpret concepts in a form other than precise and explicit definitions or directives is important.

i guess ill stop there.

Steven_Hammon · March 12, 2026, 1:36pm

In general, models suck unless you pay thousands of dollars in tokens. Really, most of the data they are trained on is pointless. If you are training an electrician, they don’t need to know much. It’s actually pretty simple after a few years. You don’t teach them poetry, or president names. You teach them how to problem solve, code, how to plan and reason, and how to use the tools.

However, the moment you release a model like that as opensource that can actually run on the hardware that the public is ALLOWED to have at a realistic price, then in general, they don’t need to pay for inference. They just create their own web crawler and their own database of sites for web browsing, to avoid paying for search, and then they can find whatever you need. They can deliver poems, and the president names from search. And then data centers are only for those who want better poems?

So the idea is, make it so frontier models can’t run locally, and make it so they can’t buy the equipment to run them locally. I could do so much if I had thousands of dollars spare but I don’t. So I can’t really do anything that I want to do. It’s like photographers back in the day. The ones who could afford the amazing cameras were making all the money. Film cameras, the same. If you can’t afford the top of the line AI Gear and models, you can’t get things to work as well as they should.

Maybe one day we will get a local model that’s actually trained to be a brilliant worker and not a fact machine.

rogerdodger91 · March 12, 2026, 10:14pm

Hmm, I disagree with Steven that models suck unless you spend money.

I am having pretty good results with Gemini-fast when it comes to code generation on a game I am working on.

Sometimes gemini-fast will break things, and I have to switch over to gemini-pro to restore things to how they should be.

What I find really useful (but also kind of tedious) is using git to commit each and every prompt and response, that way I can see what she (gemini is a she right?) is doing, why she is doing it, and if I can’t explain the fix, I can always manually intervene. (though as part of my experiment I try to do this literally as little as possible, if i can avoid it)

Here is a link to the commit history to give an example of what I am talking about.

and here is a link to the main page.

I am using AI for everything. Mostly gemini, though I am using grok imagine to create my animations.

Last week I had a much better build of the game made, but gemini totally wrecked it. Which gave me the idea that I need to use git to keep her inline. So sorry for the actual game play being horrible. Give me about a week and I will have something cooler to show off.

Topic		Replies	Views
For the end of 2025, what are we working on? Community generative-ai	3	102	November 23, 2025
Beyond Single-Turn AI: Architecture for Self-Correction Community feedback	8	273	August 13, 2025
Exploring the full spectrum of AI Tools Community ai-studio , models , gemini	0	74	December 16, 2025
Thoughts on Google TTS? Gemini API api , models	6	340	October 19, 2025
A Meta Conversation that had a promising result in two models collaborating to solve a logic problem together Community models	3	116	April 25, 2025

How do you actually evaluate LLMs?

Related topics