Skip to content

Case Study 1

All code for case studies are here: https://github.com/Python-Test-Engineer/llm-evaluation-framework

Whilst this case study uses Langgraph, the framework is not of importance.

Patterns of routing, recursion and decision making that are present in this app are generic patterns and this app could itself be a a sub-app for another app.

The principles and techniques used to evalauate AI Agents and LLMs is the core matter here.

Business ROI

This business is about writing articles/blog posts in many different languages.

It must be:

  • About AI
  • In the chosen language
  • Around a certain specified length
  • Not be sensational but very professional and factual

The business depends on its performance and reputation.

We must ensure these outcomes are met as failure to do so will jeopardise the business and its profits.

Effectively, the ROI is to be protected at the very minimum and with articles produced in accordance with the business model, the better these criteria are met, the better the business and ROI.

It is more than just working correctly.

Our evaluations must reflect this.

Article Writer App

workflow

Given a title for an article, the app determines of the artcle title is about AI.

If not it goes to the END.

If it is, it gets passed to an editor that does three things:

  1. Translates to a configured language, French in this case.
  2. Expands the length of the article title to content of more than N words with N specified in config.
  3. Makes a subjective evaluation that the article IS_NOT_SENSATIONAL.

If all three are YES, it then set the spostability flag to YES and moves to the publisher node where certain actions are taken if yes.

If the postability is still no after the MAX_RETRIES, then it too moves to publisher but the no value will cause the article not to be published.

In the app, the publisher node has no code but there is no agentic action in this node and it is deterministic.

This is a leaf level sub graph meaning it is at the end of all the Agentic actions.

This could be a sub-graph that is part of a larger graph. We would then test each sub-graph separately.

The first node should_write_article determines if the short article headline is about the selected content topic, in our case AI/Technology.

evaluator

If so, it goes to the editor that ensures:

  • It is translated into the chosen language, French in this case.
  • Has sufficient length.
  • Is not sensational.

It the above three are 'yes' then the postability flag is set to 'yes' and it goes to the publisher for publishing.

We have a MAX iterations and if the postability is not yes after MAX iterations then we move still move to publisher but the code will detect no and move to __END__ without taking yes actions.

csv

This case study shows how we can evaluate ROUTING and also parallel tool call - we can think of a node as a tool call as everything is just a function.

We can log all the output into the CSV file so that we can see more detail about the LLM call like tokens etc. or filter before dumping to log file.

We can log input_tokens and output_tokens so that we can evaluate cost.

In EVALS01 and EVALS04 which use structured output, we use tiktoken library to get token counts. In the other EVALS where we ar enot using structured output, the metadata containing this information is more readily available.

Client Requirements

We are charged with testing and monitoring this agent.

What might we need to determine?

  • Does the app as a whole produce the right final output? Language, length, not sensational?
  • If the postability is yes are the other flags all yes - this is app logic?
  • What is the latency (time) for each LLM call?
  • Can we judge the quality of the article using an LLM Judge and how is this LLM Judge created?
  • What impact does the choice of model have on cost, latency and performance? Can a compromise on quality yield siginificant cost savings without increasing latency significantly?
  • Do each of the nodes work correctly?

This can provide both technical and business effectiveness evaluation - in production, in real time and in front of the client.

Dataset

I generated 30 sets of inputs and outputs and added 'domain expert ground truths' to them.

I then ran the inputs throught he app, got the evaluations trace CSVs and the evaluated our app.

There are 4 evaluations, one for each of the four output files.

Evals

EVALS01

Test: Enusre only AI articles are processed by having 'yes' in the output of the router.

EVALS02

Test: Ensure title is translated to correct language - French

EVALS03

Test: Ensure article has correct number of words - CONENT_LENGTH - and we can also ensure language is still correct.

EVALS04

Test: Ensure if the can_be_posted is yes then the other flags are also yes - is_not_sensational, is_in_correct_langage and meets_word_count.