Evaluating AI Agents¶

Aim¶

This is a manual to establish a fricitonless way of testing and evaluating Agentic systems both for the developer and QA.

Test:
- In production
- In real time
- In front of the client
optimising performance, cost and latency - ROI!

Set business success criteria and ROI - SMART

Evaluate to measure increase in success and consequent ROI

Use meaningful labels rather than numeric scores

Evals¶

A Unit Test in agentic terms is the smallest block of code that uses an llm to determine the ROUTE and the RESPONSE.

It may contain other deterministic functionality which we can test in the usual way, but this manual focusses on testing and monitoring Agentic systems.

The App¶

In Case Study 1, we work with a Langgraph app that is modelled on an article creation business.

Article Writer

In the case study, we use evals to generate ROI - or protect it - for a business that wants to use Agents. There are of course many 'deterministic' evals/tests but that is not the focus of this case study.

Patterns¶

This example neatly uses a number of workflow pattersns - router, orchestrator and worker.

Routing¶

One fundamental pattern is ROUTING - does the Agent select the correct tool/function/skill with the correct inputs?

src\article_writer_langgraph.py shows the routing pattern for system_grader and the log output is in article_writer.csv:

article-writer-evaluator

ArticlePostabilityGrader logs to article_writer_can_publish.csv (we can have just one log file):

article-writer-evaluator

In general:

There is also NEXT - does the Agent select the correct next step where this is applicable?

Agentic Evaluation

Output¶

For a given input, we will obtain an output.

We may retrieve additonal context to support the generation of the output.

We will also have REFERENCES - ground truths.

Our goal is to get a number of datasets:

INPUT - OUTPUT - CONTEXT - REFERENCE

TOOL_CALLED - ARGUMENTS - NEXT - EXPECTED

Once we have these there are many libraries or our own custom evaluations that we can use.

We will have a confusion matrix (context may or not exist):

INPUT - OUTPUT - CONTEXT - REFERENCE

We can then work out an evaluation.

We look for OMISSIONS - ADDITIONS - CONTRADICTIONS - COMPLETENESS as alternatives to traditonal F1 scores although these can be computed as well.

We can also evaluate system fails:

system-fails

Frictionless¶

The process needs to be frictionless for developers.

At an accessability talk, the speaker said 'How do you make a blueberry muffin?'.

You put the blueberries in at the beginning and not stuff them in at the end.

There are many excellent observability and evaluation platforms.

In this manual, we use a mor DIY approach that does effectively the same - it creates observability through the use of making Agents/Tools self-reflecting and then we use a range of platform libraries to carry out evaluations in addition to our own custom made evaluations.

In essence, we save on every 'event' - an LLM/Tool call - the infromation that is necessary for evals and monitoring.

We can therefore do evals and monitioring in production, in real time and in front of the client by enabling client access to a client focused dashboard.

Log

Additional data can be added optionally in a structured way.

This serves as testing, evaluating and monitoring from development to production.

The developer and QA work towards making each UNIT fully introspective.

Having researched the various Observability tools, I think this logging approach is the simplest and possibly most effective as it separates data collection and data evaluation, and also avoids being tied to any framework.