A Chat UI Is Not the System
A production agent is a distributed system with a model inside it. Treat it as a chat feature and the first serious incident will be about identity, tools, retries, memory, or rollback.
The Missing Layer Is Not Prompting
Teams assume the agent is the model plus tool descriptions. That works in a demo with curated prompts and admin credentials. Production needs workflow state, scoped credentials, policy, idempotency, evaluation, and kill switches.
For production agent system design, a senior review should ask which autonomy decision is being made, which evidence proves it, and what signal would force the team to pause.
A Support Agent With Too Much Reach
A SaaS support agent could inspect subscriptions and apply goodwill credits. A billing API timeout triggered a retry, but the credit API was not idempotent, so credits were applied twice. A global ticket index also returned another tenant's note because tenant scope lived in the prompt, not retrieval code. The fix was an explicit workflow engine, a tool broker issuing scoped credentials, idempotency keys, approval states for credits, tenant-filtered retrieval, and a kill switch for write tools only.
The Agent as a Distributed System
Where Teams Misread the System
Demos optimize for the happy path
Kickoff demos use sandbox data, admin accounts, and three curated questions. They do not show:
- A user without permission triggering a refund tool.
- A stale index answering from last quarter's policy PDF.
- Two tool calls racing and leaving partial state.
- An auditor asking who approved an automated change.
Leadership approves budget based on the demo path. Engineering inherits every unhappy path without budget for workflow or controls.
Demos also hide latency and cost. A single curated question with warm caches does not show retrieval over a million chunks, five sequential tool calls, or retry storms when a downstream API flakes. Production users ask messy, ambiguous questions across long sessions. Workflow design must include timeouts, partial failure handling, and user-visible status, not only a spinning indicator.
Tools multiply blast radius
Connecting an agent to "read tickets" feels safe. Connecting it to "update account status" without tiered autonomy is not.
| Tool tier | Example | Production requirement |
|---|---|---|
| Read-only | Search docs | Scope indexes, log queries |
| Write-low | Create draft | Human review before send |
| Write-high | Transfer funds | Hard block or dual control |
Teams that give one super-token to the agent recreate shared admin credentials with better marketing.
Blast radius also applies to data exposure. A read tool that searches "all tickets" may return another customer's PII if indexes lack tenant filters. Write tools are obvious risks. Over-broad read tools are how agents become accidental data exfiltration channels with a friendly chat box on top.
No kill switch
When an agent misbehaves at 2 a.m., the first response cannot be "redeploy the prompt." You need disable tool X, route to human, freeze memory refresh, with owners on call.
If that playbook does not exist, mean time to recovery is however long your longest deploy pipeline takes.
Kill switches should be granular. Disabling the entire agent during a billing tool incident also disables harmless FAQ answers. Disabling only the credit tool, routing billing questions to humans, and leaving read-only subscription lookup active is how ops teams stay calm at 2 a.m.
Bad Shape, Better Shape
Bad: chat UI plus tools
User --> Chat UI --> Model --> any tool with service account
Failures:
- No audit trail of which tool fired for which user.
- No per-tenant or per-role tool list.
- No eval set from real workflows.
- Incidents debugged by reading raw prompts in logs.
Good: orchestrated agent system
User --> API gateway (auth) --> workflow engine --> model steps
|
+--> policy engine (allow/deny/approve)
+--> tool broker (scoped credentials)
+--> trace + eval hooks
Properties:
- Each request has identity, correlation id, and policy version.
- Tools run with least privilege, not platform admin.
- Human approval is a first-class state, not a Slack message.
- Eval jobs replay de-identified production traces after changes.
The good pattern does not require a specific vendor stack. It requires the same architectural habits you already use for customer-facing APIs: authentication at the edge, explicit contracts between components, and change management that includes regression checks before full rollout.
What the Pattern Teaches
Models got cheaper and faster. Reliability did not become automatic.
Production agents fail for the same reasons microservices failed when teams shipped services without tracing, idempotency, or ownership. The model is one component. System design is the product.
CTOs should ask the same questions they ask for a new payments API:
- Who can call it?
- What breaks if it is wrong?
- How do we detect regression?
- How do we shut it off without taking down the whole site?
Funding should follow the same bar. If you would not ship a payments endpoint without integration tests and an incident runbook, do not ship a funds-adjacent agent without the agent equivalent. The model is not a shortcut around platform engineering.
Worked Example: SaaS support agent
A B2B SaaS team shipped an agent that could "look up subscription status" and "apply credits."
| Design | Credit tool | Result |
|---|---|---|
| Demo | admin API key in agent | credits applied without ticket link |
| Production v1 | read-only subscription lookup | safe, limited value |
| Production v2 | credit tool with approval queue + audit | ops trusts automation |
The model did not change between v1 and v2. System design changed.
Between v1 and v2 the team also gained measurable confidence. Support leads could see which credits were proposed, which were approved, and which user identity triggered each action. That visibility turned the agent from a demo curiosity into something ops would route real tickets through.
Where This Shows Up: fintech and SaaS
Fintech: agents that touch client data or money movement need segregation of duties and immutable audit logs. A fluent answer that triggers the wrong ledger adjustment is a Sev1, not a UX bug.
SaaS: multi-tenant agents must not leak context across tenants. Workflow engines enforce tenant id on every retrieval and tool call, not hope in the prompt.
The pattern is identical: orchestration and policy before prompt tuning.
Where Production agent system design Control Belongs
Demo agent:
User --> Chat UI --> Model --> tools with shared token
Production agent:
User --> Auth gateway --> Workflow engine
|--> Policy engine
|--> Context plane
|--> Tool broker with scoped credentials
\\--> Traces, evals, rollback controls
The Autonomy Tradeoff
The tradeoff is autonomy versus reversibility. The more irreversible the action, the more the architecture should resemble a controlled workflow and the less it should resemble an open-ended conversation.
For production agent system design, the useful review is not a generic architecture checklist. It should inspect permission, context, tool behavior, eval evidence, and rollback. If those fields are missing, the team may still be busy, but leadership does not yet have a decision-quality artifact.
What Leaders Should Inspect
For AI systems, the release review should inspect the path taken before the model produces text. Start with identity. Which user, tenant, role, policy version, and task type entered the workflow? Which sources became eligible because of that identity? Which sources were rejected? If the team cannot answer from logs, the system is not auditable enough for production.
Next review context freshness. Static embeddings, vector indexes, document stores, CRM fields, and tool results all need owners and maximum age. A model answering from a stale policy is not hallucinating in the usual sense. It is faithfully using bad evidence. Critical sources should fail closed when freshness checks fail. Less critical sources can degrade with labels, but the degradation should be deliberate and visible.
Then review tools by blast radius. Read-only tools still leak data if scope is broad. Draft tools create review burden. Write tools change state and need idempotency, approval thresholds, scoped credentials, and rollback behavior. A shared service token is an architectural smell because it erases the user on whose behalf the action happened.
Finally, inspect evals and incident controls. Evals should replay de-identified production-shaped traces and score source selection, permission compliance, tool choice, and outcome. Kill switches should be granular: disable a credit tool, freeze memory refresh, force human approval, or route one intent class away from automation without taking down harmless read-only use cases.
Bad Paths to Test
AI release tests should include permission denial, empty retrieval, stale retrieval, wrong-tenant retrieval, tool timeout, duplicate tool retry, malformed tool output, long conversation memory, and user attempts to override policy. Test cost and latency under realistic tool chains, not only one warm prompt. Test whether the answer cites the source that actually justified the action.
For agentic workflows, include partial completion. The model may draft a response after a tool failed, or execute one side effect before another tool times out. The workflow must know whether to compensate, ask for approval, fail closed, or resume from a durable state.
Paths Worth Comparing for Production agent system design
Keep early agents read-only when trust is low. Let agents draft actions for human approval before granting autonomy. Block or dual-control irreversible actions such as funds movement, account closure, claims approval, or high-value credits.
In production agent system design, the alternative paths are not steps on a ladder. Each one carries a different mix of risk, cost, and learning. The weak choice is the one that hides the tradeoff until users, operators, or auditors discover it for you.
The Rule for Production Agents
A production agent is a distributed system with a model inside it. Treat it as a chat feature and the first serious incident will be about identity, tools, retries, memory, or rollback. The practical lesson is to demand evidence that fits production agent system design, not a universal checklist. The artifact should expose permission, context, tool behavior, eval evidence, and rollback clearly enough for another team to challenge the decision.
If production agent system design is the decision in front of your team, use the Codebase Context Scan to pressure-test the boundary before it hardens.
