AI Agents

AI Evals: How We Prove an Agent Actually Works

By Niall · 7 min read

AI Agents

'It seems to work' is not proof. Here is how we use evals to turn an agent's quality from a hopeful hunch into a number you can trust.

Ask most teams how they know their AI agent works, and the answer is some version of 'we tried it and it seemed fine'. That is a reasonable way to start and a dangerous way to ship. An agent that seems fine across a dozen friendly questions can fail quietly on the hundreds of real ones you did not think to try, and you tend to find out only when a customer does.

The discipline that fixes this is evaluation, usually shortened to evals. It is the closest thing AI work has to a test suite, and it is the difference between improving an agent on evidence and changing prompts on a hunch. Here is what evals are, and how we build them.

What an eval actually is

An eval is simply a set of example inputs paired with a way to judge the outputs. You collect realistic cases the agent will face, run them through it, and score how well it did against a standard you defined in advance. Done well, it turns a vague feeling that the agent is 'pretty good' into a concrete number you can track, compare and defend. The first time you watch that number move after a change, the value of evals stops being abstract.

Why 'it seems to work' is not enough

Informal testing has two blind spots. First, you naturally test the cases you can imagine, which are rarely the ones that break things; real users are far more creative than any quick check. Second, without a fixed test set you cannot tell whether a change actually helped. You tweak a prompt, the one example you happen to check looks better, and you have no idea what it quietly broke elsewhere. Evals close both gaps by making quality measurable and changes comparable.

Building a test set that means something

Real examples: actual queries and tasks, drawn from logs or realistic scenarios, not invented ideal cases.
The awkward cases: ambiguous, adversarial and out-of-scope inputs, where agents most often go wrong.
A clear expected outcome for each: an exact answer, an acceptable range, or a rule the output must satisfy.
Coverage of what actually matters to you: accuracy, safety, tone, and staying within scope.

Scoring outputs without fooling yourself

Scoring is where evals get interesting, because agent outputs are rarely a simple right or wrong. For some tasks you can check exactly: did it return the correct value, call the right tool, stay within policy. For open-ended answers you need a rubric, with criteria like factual accuracy, completeness and tone, scored consistently. A common and effective technique is to use a strong model as a judge against that rubric, with humans reviewing a sample to keep the automated scoring honest. The aim throughout is to measure what you actually care about, not just what is easy to count.

Evals are not a one-time gate before launch. They are a living test set you grow every time the agent meets a case it handled badly, so today's failure becomes tomorrow's regression test.

This is the unglamorous backbone of agents that hold up in the real world. Without evals, every change is a gamble and every claim that the agent 'works' is a matter of faith. With them, you can improve deliberately and prove the result. It is a core part of how we build AI agents that survive contact with production, and if you have an agent you are not yet confident enough to trust, building the right eval set around it is usually where we start.

Relevant services