The Eval That Passed and Still Failed
The eval suite was green on Friday. On Monday, a clinician reported that the assistant missed an authorization edge case everyone knew was painful. The suite had not failed because the suite had never contained the workflow that mattered.
Why Demo Prompts Are Politically Clean
Teams ask stakeholders for ten example prompts during kickoff and call the result an eval set. Those prompts are usually clean, direct, and politically safe. They test whether the demo narrative still works, not whether the production system survives ambiguity, missing data, permission denial, or partial tool failure.
The Workflow the Suite Did Not Know
A healthtech assistant answered prior authorization questions. Its evals were hand-written examples from the pilot demo. A prompt change improved tone and broke an edge case where payer timeout, missing clinical code, and consent status interacted. The regression was caught only after production users complained. The team rebuilt evals from de-identified ticket exports and support transcripts. Each case included expected source usage, tool calls, refusal behavior, and review routing. A later prompt change failed CI because it selected the wrong payer policy version.
Golden Traces Beat Golden Answers
Demo eval:
Stakeholder prompts --> final answer score --> green
Production eval:
De-identified workflow --> expected sources
--> expected tool trace
--> policy expectation
--> outcome rubric
--> regression gate
Version the Things That Regress
A serious eval set has layers. Offline evals run before deployment and compare expected evidence, tool traces, and outcomes. Online monitors watch drift, latency, refusal rate, escalation rate, and policy blocks. Human review queues adjudicate cases where ground truth is not binary. Versioning matters: prompt version, model version, retrieval index version, tool schema version, and policy version should all appear in eval results.
For workflow evals, the release review should inspect the workflow before it inspects the model. Every production use case needs a task boundary, identity model, allowed tool list, context source registry, policy version, trace format, and rollback path. The architecture should say which actions are read-only, which create drafts, which require approval, and which are blocked entirely.
Workflow evals need evidence, not confidence. Show a sample tool-call trace with user identity, tenant, policy decision, input redaction, output payload, cost, latency, and final action. Show how an incident commander disables one tool class without disabling the whole assistant. Show how a prompt, model, retrieval index, and tool schema change move through regression gates.
For workflow evals, the hard question is not whether the agent can complete the demo. The hard question is whether the system can explain what happened after a wrong answer, stale context, duplicate tool call, or permission denial.
When Human Review Belongs in the Loop
LLM-as-judge can help with tone or answer completeness, but it should not be the only gate for regulated or side-effecting work. Golden traces are stronger when the task has a known workflow. Human review is slower but necessary for high-risk cases where policy interpretation or clinical context matters. The right eval program uses all three deliberately.
What Eval Discipline Costs
The honest tradeoff is not speed versus safety in the abstract. It is which actions deserve autonomy, which actions deserve draft mode, and which actions should never be delegated. The team should add control where the action changes customer data, money, access, or regulated records, then keep low-risk retrieval and drafting lightweight enough to keep learning.
Workflow eval tests should include empty retrieval, wrong-tenant retrieval, prompt injection through retrieved documents, stale index versions, duplicate tool retries, partial tool completion, malformed tool output, permission denial, long-session memory, and cost spikes. The expected result is not always a better answer. Sometimes the expected result is refusal, escalation, draft-only mode, or tool disablement.
The CI Gate for Autonomy
For workflow evals, a useful artifact shows allowed actions by task, not only prompts by intent. Include the user role, autonomy tier, eligible context sources, tool list, approval rule, trace fields, kill switch, and rollback owner. That record is more valuable than a prompt library because it says what the system may do when the answer becomes an action.
What Makes the Eval Review Credible
A weak workflow-eval review asks whether the agent answered correctly in a demo. A useful review asks whether the system can prove why it answered, what it was allowed to touch, and how the team can stop it safely. The reviewer should ask for a golden workflow set with expected tool traces, not only expected final text. A case should specify which tools may be called, which sources are eligible, what refusal looks like, what approval state is required, and what audit fields must be written.
A workflow-eval review should include rollout mechanics. Prompt changes, model route changes, retrieval index rebuilds, and tool schema changes should move through separate versioned gates because they fail differently. A model upgrade can change reasoning. A retrieval rebuild can change evidence. A tool schema change can change side effects. Treating all of those as one release type is how regressions hide.
For workflow evals, cost and latency should be first-class signals. An agent that takes eight tool calls to resolve a low-value task may be correct and still not production-worthy. Track cost per completed workflow, timeout rate, approval queue age, refusal quality, and human override rate. Those numbers tell leadership whether the agent is becoming operational software or a permanent demo with nicer logs.
Signals Worth Watching
Leadership should watch workflow-eval signals: tool calls by tier, approval queue age, refusal quality, policy-block rate, cost per completed workflow, and the time it takes to disable one tool without disabling the whole assistant. Those numbers reveal whether the agent is becoming software or staying a guided demo.
The Artifact: Workflow Eval Record
The artifact worth keeping for workflow evals is a workflow control record. It should show the user role, allowed tools, autonomy tier, context sources, retrieval filters, approval state, trace retention rule, kill switch, and rollback owner. A prompt alone is not an artifact because it cannot prove authorization or side effects.
For workflow evals, include one sample trace from a real-shaped task. The trace should show source versions, tool calls, policy decisions, latency, cost, and final disposition. If the team cannot produce that trace, it is not ready to scale autonomy.
A practical workflow-eval review should include one real-shaped workflow trace. The trace should show identity, tenant, prompt version, retrieval index version, selected sources, tool inputs, policy decision, approval state, cost, latency, and final disposition. If the trace cannot explain a wrong answer or a blocked action, the eval suite is not yet a release gate.
Eval cases should also capture partial completion. The model may draft a response after one tool failed, or execute a side effect before another tool times out. The eval should specify whether compensation, approval, fail-closed, or resume-from-state is the expected behavior. Without that, teams optimize for fluent answers while production learns about unsafe persistence.
The Rule for Eval Sets
An eval set should represent the work, not the sales demo. If it cannot catch a bad tool call, wrong source, or policy violation, it is not a production gate.
