How do you actually evaluate LLMs?

Hi, I’m curious how people here actually choose models in practice.

We’re a small research team at the University of Michigan studying real-world LLM evaluation workflows for our capstone project.

We’re trying to understand what actually happens when you:

•	Decide which model to ship

•	Balance cost, latency, output quality, and memory

•	Deal with benchmarks that don’t match production

•	Handle conflicting signals (metrics vs gut feeling)

•	Figure out what ultimately drives the final decision

If you’ve compared multiple LLM models in a real project (product, development, research, or serious build), we’d really value your input.

Short, anonymous survey (~5–8 minutes):

https://forms.gle/xEGuDCZ3UBmitGCz6

Mostly i find engaging in conversation during some task/project works well (for me)
My theory on it is - im gathering data to intuit factors like:

  • rigidity vs creativity
  • consistency with the task
  • flexibility in philosophy
    essentially im looking for a chaos quotient suitable for emergence while maintaining attentive contextual coherence and subjective allignment.

… i just realized explaining this is harder than i thought…
bottom line some random red flags:

  • focuses too much on immediate details, as opposed to integrating holistic abstraction.
  • grey hallucinations: remembering the right thing, with wrong details.
  • inability to comprehend analogy or metaphor. (big one)
    Being able to interpret concepts in a form other than precise and explicit definitions or directives is important.

i guess ill stop there.