LLM as judge¶
A judge¶
The technique of an (untested) LLm generating question/answer pairs to act as ground truths to then test an LLM seems illogical but current thinking states this is quite effective.
This technique might be used by developers to improve the knowledge system as they work on it rather than getting human evaluations at every step.
Smaller, fine-tuned models maay be more effective that larger models and if we use Open Source models we need only pay for compute.
Arxiv Paper: https://arxiv.org/pdf/2412.05579
Definitive guide to building LLM Judges
What has been found to be most effective is to not get an LLM Judge to come up with its own answer and then see if this is 'equal' to the answer of the agent. There is something ungrounded in logic present with this, an untested technique testing an agent.
It is better to have the LLM Judge give an answer and reason as to whether given the query and purpose of the agent, is the output 'good'.
What is 'good'? What is a 'good email' if we are asking our judge to evaluate emails produced by the agent.
This is where a number of examples of both 'good' and 'bad' emails are given in the prompt.
Building effective judges is a skill in its own right.
Uses¶
Some example additional uses:
- Politeness: Is the response respectful and considerate?
- Bias: Does the response show prejudice towards a particular group?
- Tone: Is the tone formal, friendly, or conversational?
- Sentiment: is the emotion expressed in the text positive, negative or neutral?
- Hallucinations: Does this response stick to the provided context?
- Adversarial: We can test it does NOT do things as well as test edge cases.
By asking for not just the grade but its reasoning, we can get a more complete picture of how the judge is evaluating the LLM.
Code demo:
Examples
https://www.youtube.com/watch?v=LZJTrAXcyFM
Testing the Judge¶
If we have a Golden Dataset, we can run the judge over it and see how it compares to our domain exppert.
Periodic re-evaluation will be needed.