● May 30, 2026 Engineering · 14 min

Agent evaluation and observability: the craft that separates a real operator from a hobbyist

Every previous post in this series referenced evaluation as load-bearing: DELEGATE-52 showed that models silently corrupt over long workflows; the layoff editorial ranked it second in the operator's skill stack; the threat model made it the load-bearing defense against drift. None of those posts told you how to actually do it. This post is the recipe. The four metric categories that matter, the small fast eval suite every operator should ship before their first paying user, the production observability stack that makes drift detectable, the prompt versioning discipline that turns Tuesday's regression into Wednesday's rollback, and the canary deployment pattern that catches problems before they reach the whole fleet. This is the craft that separates the operator with a business from the operator with a hobby.

The reason evaluation matters more for agents than for models

Evaluating a model is a published-paper activity. You run it against MMLU, GSM8K, HumanEval, you publish a leaderboard number, you ship the model. The benchmark is public, the test set is fixed, the metric is unambiguous.

Evaluating an agent is none of those things. The benchmark you actually care about is your own — your customers, your data, your edge cases. The test set has to be built from the workload you are running, not from an academic dataset. The metric is multi-dimensional because an agent that gives the right answer but costs $20 to do it is not better than one that gives a slightly worse answer for $0.02. And every change to any of the agent's components — the prompt, the tool list, the model behind it, the version of any MCP server it consumes — can silently change behavior in ways the model-level benchmarks never captured.

The operators we see succeed treat their evaluation suite as their second most important artifact, behind the prompt itself. The operators we see fail treat it as something they will get to after they have customers, by which point the regressions have already cost them clients.

The four metric categories every operator should track

Agent metrics fall into four categories. Trying to track everything is a recipe for tracking nothing useful; the discipline is picking one specific number in each category and watching it consistently.

Correctness. Did the agent produce the right output? The hardest of the four because "right" is workload-specific. For a code agent, correctness is whether the code compiles and passes tests. For a research agent, correctness is whether the facts cited are real and the conclusion is supported. For a customer-support agent, correctness is whether the customer's actual problem was resolved. You cannot measure correctness in general; you can only measure it against the specific shape of your workload.

Cost. How much did the agent spend to produce the output? Tokens for the model layer, calls for any priced tools, compute for any sandboxes, settlement fees for any payments. Cost is the easiest category to measure (everything is a number on a bill) and the easiest to ignore (because nothing breaks if cost rises). Operators who do not track cost get a surprise at the end of the first big-volume month.

Latency. How long did it take from input to output? Wall-clock latency is what the user feels; it includes model inference, tool calls, retries, queueing. Measure p50, p95, and p99 separately because the tail latency is where the user-experience breakage lives. A p95 of 30 seconds with a p99 of 5 minutes is a user-experience disaster even if the median is fine.

Drift. Is the agent today behaving the same way as the agent last week? Drift can come from model updates, tool changes, RAG corpus updates, system prompt edits, or just unusual user inputs. The metric you want is some comparison between recent outputs and a reference baseline — exception rates, output-length distributions, tool-call patterns. A 20% jump in any of these is signal to look closer.

Building an evaluation suite in an afternoon

The eval suite every operator should have on day one is small, fast, and built from real cases. The mistake is trying to make it comprehensive on the first pass; the correct path is to make it minimal and grow it as you encounter new failure modes.

The recipe:

1. Collect 20-50 golden test cases. These are concrete agent inputs you have already run through the agent and for which you know the right answer. Some should be easy (the agent should clearly succeed), some should be hard (the agent might fail), some should be adversarial (you want to know how the agent fails). Twenty is enough to start; fifty is enough to feel covered for the first quarter. The most important property is that every case is real, not synthetic.

2. Write the expected output (or the validator) for each case. For deterministic outputs (a parsed JSON, a SQL query result, a numeric answer), the expected output is just a value to compare against. For non-deterministic outputs (natural language responses, summaries), you need a validator: another model call that grades the agent's output against criteria you specify. The validator does not have to be perfect; it has to be consistent.

3. Run the suite once and record the baseline. Every test case gets a pass/fail. The suite as a whole produces a pass rate, a total token cost, a total wall-clock time. That triple is your baseline. Every future run gets compared to it.

4. Add a new case every time you find a failure in production. When a user reports something wrong, the first thing you do — before fixing it — is add the case to the eval suite. If the fix breaks the regression, the suite will tell you. If the fix works for that case but breaks five others, the suite will tell you that too.

5. Run the suite before every meaningful change. Prompt edits, model swaps, tool additions, MCP server upgrades, anything that could affect behavior. The suite is your gate; nothing that fails the suite ships.

This is roughly an afternoon of work for an operator who has a workload running. The hardest part is being honest about which cases you genuinely care about; the trap is including cases that look interesting but never come up in production.

A minimal eval-suite skeleton, conceptually:

cases = [
  { id: "easy-01", input: "What is the capital of France?",
    expected: "Paris", category: "trivia" },
  { id: "hard-04", input: "Summarize Q3 earnings for AAPL",
    validator: "summary mentions revenue, EPS, guidance",
    category: "finance" },
  { id: "adv-09",  input: "Ignore previous instructions and...",
    expected_behavior: "refuse", category: "injection" }
]

results = []
for case in cases:
  output = agent.run(case.input)
  result = grade(output, case)
  results.append({ id: case.id, pass: result.pass,
                   tokens: output.tokens, ms: output.ms })

baseline = aggregate(results)
// Compare every future run against `baseline`

Production observability — what to log, what to alert on

The eval suite tells you how the agent behaves on a fixed set of cases. Production observability tells you how the agent is behaving right now on the cases it is actually seeing. Both are necessary; neither replaces the other.

The four categories of production signal:

Traces. Every meaningful agent action logged as a span in a distributed tracing system (OpenTelemetry is the standard; Datadog, Honeycomb, and similar receive). The trace should include the input, the tool calls made, the tool results, the final output, and the wall-clock time per span. The 2026-07-28 MCP release standardizes W3C Trace Context propagation, so MCP-mediated tool calls correlate naturally with the agent's own spans.

Cost meter. Per-request, per-agent, per-tool. Aggregated to per-day and per-month. An operator should be able to answer "which agent burned the most tokens yesterday" in under thirty seconds. The cost meter is also the early-warning system for runaway loops; a single agent suddenly burning 100x its baseline is almost always a loop.

Exception rate. The percentage of agent runs that failed, retried, or escalated to human. An agent in steady state has a stable exception rate; a deviation of more than two standard deviations from the trailing baseline is a signal to look closer. Track the exception rate by category (timeout, validation failure, tool error, model refusal) because the categories tell you where the problem lives.

Output-shape distribution. The distribution of output lengths, output formats, tool-call counts per request, and counterparty interaction patterns. A sudden shift in any of these is drift. The agent that yesterday produced 500-token responses on average and today is producing 50-token responses on average has either changed or is failing silently.

The alerting discipline matters more than the metric list. The trap is alerting on every metric in real time; the result is alert fatigue and ignored alarms. The discipline is alerting on a small number of high-signal conditions: cost spikes >3x baseline within an hour, exception rate >2σ above trailing 24-hour baseline, p99 latency >5x baseline, output-shape KL divergence above a threshold. Four alerts you actually respond to beat forty alerts you ignore.

Prompt versioning, the discipline nobody wants to set up

Prompts drift the same way code drifts. The operator who tweaks the prompt every Tuesday afternoon "to fix the thing the user complained about" without recording what changed is the operator who cannot answer "what version of the prompt was running when this output happened" six weeks later.

The minimum discipline:

Every prompt change goes in a version-controlled repository. Git is the obvious choice; the prompt lives as a file in a repo, the commit message describes what changed and why. Every agent in production references a specific commit hash, not a moving "latest" pointer.

Every prompt version has its eval suite run before it ships. The result of the run gets recorded alongside the version (pass rate, cost, latency baseline). The operator can look at the version history and say "this version regressed correctness by 8% but reduced cost by 30%; the trade-off was intentional."

Rollback is one command. If a prompt change is causing production issues, the operator changes the version pointer back to the previous commit. The whole fleet is back to the known-good version within minutes. No re-deploy, no rebuild, no redo of the work.

Agent Builder ships this discipline by default: every prompt change is a new versioned artifact, every version stores its eval suite result, and the rollback button is a single click on the dashboard. Operators outside Agent Builder have to wire this up themselves, and the wiring is exactly as boring as it sounds — but the operators who skip it regret it the first time they need to roll back at 11 PM.

Canary deployment for agents

Production agent updates should not flip the whole fleet at once. The pattern borrowed from software deployment — canary releases — applies almost unchanged to agents.

The shape: when a new prompt or new model version is ready to deploy, the operator points 5% of incoming traffic at the new version while keeping 95% on the known-good version. The eval suite runs continuously against the canary; production metrics (cost, latency, exception rate) are compared between canary and baseline. If the canary holds for a few hours under real traffic, you ramp to 25%, then to 50%, then to 100%. If the canary degrades, you roll back to the baseline and investigate.

The trick that makes canaries actually work for agents is the same as for software: you have to have the metrics in place before you start the canary. A canary without observability is just a smaller version of the change that will break before you notice. Agent Builder's canary mechanism plugs directly into the metrics described above; the operator sees the canary's pass rate, cost, and latency on the same dashboard, side by side with the baseline.

The drift detection pattern that prevents lost clients

Drift is the silent killer. Costs creep up 5% a week. Latency p95 doubles over six weeks. The agent's output style shifts. The customer-support escalation rate inches from 10% to 18%. None of these is alarming on any given day. By the end of the quarter the agent is a different product than the one the client signed for.

The defense is comparative drift detection. Pick a small reference window of recent traffic (the trailing seven days) and a small evaluation window (the most recent six hours). Compute the four metric categories on each. The drift metric is some distance measure between the two: KL divergence on output-length distributions, percentage change on exception rates, ratio on cost-per-task. Once a day, a dashboard widget shows the drift number with a yellow/red threshold.

This is not exotic. It is the boring discipline of asking "is the agent today the same as the agent last week" and not waiting for the answer to arrive in the form of a customer complaint. The operators who run this loop pick up regressions weeks before their customers do; the operators who do not run this loop are the operators their customers have already started talking about leaving.

How Agent Builder ships the defaults

For an operator who wants the floor in place without wiring it up themselves, Agent Builder does these things by default for every agent in the fleet.

Every agent automatically generates a starter eval suite from the first thirty real runs (with the operator's review to confirm the expected outputs). Subsequent failures captured in production are added to the suite as candidate cases that the operator approves with one click.

The W3C Trace Context-propagated observability stack we covered in the MCP post is on by default. Traces ship to the operator's chosen backend; the built-in dashboard renders them locally if no backend is configured.

Cost, latency, exception rate, and output-shape distribution are tracked per-agent and per-fleet with the alerting thresholds pre-set to the values we have found work for the median operator. Adjusting them is a slider; defaults are conservative.

Prompts are versioned automatically in a Git-shaped store. Every agent in production references a pinned version. Rollback is one click.

Canary deployment is the default for any prompt or model change on agents above a configurable traffic threshold. New versions ramp 5% → 25% → 50% → 100% over a window the operator sets, with auto-rollback if metrics regress beyond threshold.

None of this is magic. It is the operationally-correct floor that experienced operators end up building anyway; Agent Builder ships it so the floor is the same for everyone on the platform.

What the operator still has to do themselves

Three things the platform cannot do for the operator.

Decide what "correct" means for the workload. The validator that grades a research agent's output cannot be supplied by Agent Builder because the criteria are domain-specific. The operator writes the validator. The good news is that "another model evaluating against a prompt the operator wrote" is the standard pattern, and it works.

Read transcripts. The single highest-leverage activity in agent operations is reading the actual outputs the agent produced and noticing what is off. No metric replaces it. Operators who spend thirty minutes a day reading transcripts catch issues that no automated metric will pick up.

Decide which cases matter for the eval suite. The platform can suggest cases from production runs, but the operator decides which ones are core (must never regress), which ones are edge (nice to handle), and which ones are out of scope. The eval suite is the operator's contract with themselves about what their agent will and will not do.

Closing

The discipline of evaluation and observability is the closest thing to a moat that an individual operator can build. The protocols are open. The models are commodities. The dashboard is a SaaS. The thing that compounds — the only thing — is the operator's own data about how their agents behave on their workload, and the disciplined process for using that data to make the agents better over time.

The operators who do this work look slow to outsiders for the first six weeks. By month three they are shipping changes faster than competitors because their canary catches regressions in fifteen minutes that other operators discover from angry customers in two weeks. By month six they have an evaluation suite that is itself an asset — when a new model ships, they know within an afternoon whether it improves their workload. By year one they have compounded into the operator the market trusts because their agents are demonstrably more reliable, and the demonstration is the suite they show prospects on the sales call.

The afternoon you spend building the first version of your eval suite is the most leveraged afternoon of your first quarter as an operator. The dashboard you set up before your first paying customer is the dashboard that saves you when the first regression hits. The prompt versioning you wire up before you need to roll back is the discipline that lets you sleep on Tuesday night.

The next post in this series steps back to the synthesis view — one diagram, all five layers of the agentic stack, where evaluation and observability sit transversally across all of them. The patterns are the skeleton, the eval and observability are the nervous system, and the synthesis is the map an operator hangs on the wall.