Skip to content

Key Ideas

  1. LLM evals ≠ benchmarking.

  2. LLM evals are a tool, not a task.

  3. LLM evals ≠ software testing.

  4. Manual + automated evals.

  5. Use reference-based and reference-free evals.

  6. Think in datasets, not unit tests.

  7. LLM-as-a-judge is a key method. LLM judges scale rather than replace human evals.

  8. Use custom criteria, not generic metrics.

  9. Start with analytics.

  10. Evaluation is a moat.

Reasons why an LLM judge gives label helps teams know what to fix...

Single most important point

Avoid numeric scores but use categorical labels. What does a 7 mean?

do not use numerical

This research showed that LLMs give 1 or 10 not a linear scale.

avoid numerical

LLM evalas are part of a bigger testing programme.

chat-router

Not all evals are LLM based - traditional code based evals are also useful.

evals

We can have a matrix of evaluations in one component.

evals