Case Study 1¶
All code for case studies are here: https://github.com/Python-Test-Engineer/llm-evaluation-framework
Whilst this case study uses Langgraph, the framework is not of importance.
Patterns of routing, recursion and decision making that are present in this app are generic patterns and this app could itself be a a sub-app for another app.
The principles and techniques used to evalauate AI Agents and LLMs is the core matter here.
Business ROI¶
This business is about writing articles/blog posts in many different languages.
It must be:
- About AI
- In the chosen language
- Around a certain specified length
- Not be sensational but very professional and factual
The business depends on its performance and reputation.
We must ensure these outcomes are met as failure to do so will jeopardise the business and its profits.
Effectively, the ROI is to be protected at the very minimum and with articles produced in accordance with the business model, the better these criteria are met, the better the business and ROI.
It is more than just working correctly.
Our evaluations must reflect this.
Article Writer App¶
Given a title for an article, the app determines of the artcle title is about AI.
If not it goes to the END.
If it is, it gets passed to an editor that does three things:
- Translates to a configured language, French in this case.
- Expands the length of the article title to content of more than N words with N specified in config.
- Makes a subjective evaluation that the article IS_NOT_SENSATIONAL.
If all three are YES, it then set the spostability flag to YES and moves to the publisher
node where certain actions are taken if yes
.
If the postability is still no
after the MAX_RETRIES, then it too moves to publisher
but the no
value will cause the article not to be published.
In the app, the publisher
node has no code but there is no agentic action in this node and it is deterministic.
This is a leaf level sub graph meaning it is at the end of all the Agentic actions.
This could be a sub-graph that is part of a larger graph. We would then test each sub-graph separately.
The first node should_write_article
determines if the short article headline is about the selected content topic, in our case AI/Technology.
If so, it goes to the editor that ensures:
- It is translated into the chosen language, French in this case.
- Has sufficient length.
- Is not sensational.
It the above three are 'yes' then the postability flag is set to 'yes' and
it goes to the publisher
for publishing.
We have a MAX iterations and if the postability is not yes
after MAX iterations then we move still move to publisher
but the code will detect no
and move to __END__
without taking yes
actions.
This case study shows how we can evaluate ROUTING and also parallel tool call - we can think of a node as a tool call as everything is just a function.
We can log all the output into the CSV file so that we can see more detail about the LLM call like tokens etc. or filter before dumping to log file.
We can log input_tokens
and output_tokens
so that we can evaluate cost.
In EVALS01 and EVALS04 which use structured output, we use tiktoken
library to get token counts. In the other EVALS where we ar enot using structured output, the metadata containing this information is more readily available.
Client Requirements¶
We are charged with testing and monitoring this agent.
What might we need to determine?
- Does the app as a whole produce the right final output? Language, length, not sensational?
- If the
postability
isyes
are the other flags allyes
- this is app logic? - What is the latency (time) for each LLM call?
- Can we judge the quality of the article using an LLM Judge and how is this LLM Judge created?
- What impact does the choice of model have on cost, latency and performance? Can a compromise on quality yield siginificant cost savings without increasing latency significantly?
- Do each of the nodes work correctly?
This can provide both technical and business effectiveness evaluation - in production, in real time and in front of the client.
Dataset¶
I generated 30 sets of inputs and outputs and added 'domain expert ground truths' to them.
I then ran the inputs throught he app, got the evaluations trace CSVs and the evaluated our app.
There are 4 evaluations, one for each of the four output files.
Evals¶
EVALS01¶
Test: Enusre only AI articles are processed by having 'yes' in the output of the router.
EVALS02¶
Test: Ensure title is translated to correct language - French
EVALS03¶
Test: Ensure article has correct number of words - CONENT_LENGTH - and we can also ensure language is still correct.
EVALS04¶
Test: Ensure if the can_be_posted
is yes
then the other flags are also yes - is_not_sensational
, is_in_correct_langage
and meets_word_count
.