Key Ideas¶
-
LLM evals ≠ benchmarking.
-
LLM evals are a tool, not a task.
-
LLM evals ≠ software testing.
-
Manual + automated evals.
-
Use reference-based and reference-free evals.
-
Think in datasets, not unit tests.
-
LLM-as-a-judge is a key method. LLM judges scale rather than replace human evals.
-
Use custom criteria, not generic metrics.
-
Start with analytics.
-
Evaluation is a moat.
Reasons why an LLM judge gives label helps teams know what to fix...
Avoid numeric scores but use categorical labels. What does a 7 mean?
This research showed that LLMs give 1 or 10 not a linear scale.
LLM evalas are part of a bigger testing programme.
Not all evals are LLM based - traditional code based evals are also useful.
We can have a matrix of evaluations in one component.