Agentic AI · Fintech

Why Most AI Agents Fail in Production: The Missing Layer Is System Design

Bala Velayutham

19 MAY 2025

Agentic AI Fintech View Agentic AI service

Jump to section

Summarize with AI

A Support Agent With Too Much Reach

A SaaS support agent could inspect subscriptions and apply goodwill credits. A billing API timeout triggered a retry, but the credit API was not idempotent, so credits were applied twice. A global ticket index also returned another tenant's note because tenant scope lived in the prompt, not retrieval code. The fix was an explicit workflow engine, a tool broker issuing scoped credentials, idempotency keys, approval states for credits, tenant-filtered retrieval, and a kill switch for write tools only.

The Agent as a Distributed System

Where Teams Misread the System

Demos optimize for the happy path

Kickoff demos use sandbox data, admin accounts, and three curated questions. They do not show:

A user without permission triggering a refund tool.
A stale index answering from last quarter's policy PDF.
Two tool calls racing and leaving partial state.
An auditor asking who approved an automated change.

Leadership approves budget based on the demo path. Engineering inherits every unhappy path without budget for workflow or controls.

Demos also hide latency and cost. A single curated question with warm caches does not show retrieval over a million chunks, five sequential tool calls, or retry storms when a downstream API flakes. Production users ask messy, ambiguous questions across long sessions. Workflow design must include timeouts, partial failure handling, and user-visible status, not only a spinning indicator.

Tools multiply blast radius

Connecting an agent to "read tickets" feels safe. Connecting it to "update account status" without tiered autonomy is not.

Tool tier	Example	Production requirement
Read-only	Search docs	Scope indexes, log queries
Write-low	Create draft	Human review before send
Write-high	Transfer funds	Hard block or dual control

Teams that give one super-token to the agent recreate shared admin credentials with better marketing.

Blast radius also applies to data exposure. A read tool that searches "all tickets" may return another customer's PII if indexes lack tenant filters. Write tools are obvious risks. Over-broad read tools are how agents become accidental data exfiltration channels with a friendly chat box on top.

No kill switch

When an agent misbehaves at 2 a.m., the first response cannot be "redeploy the prompt." You need disable tool X, route to human, freeze memory refresh, with owners on call.

If that playbook does not exist, mean time to recovery is however long your longest deploy pipeline takes.

Kill switches should be granular. Disabling the entire agent during a billing tool incident also disables harmless FAQ answers. Disabling only the credit tool, routing billing questions to humans, and leaving read-only subscription lookup active is how ops teams stay calm at 2 a.m.

Bad Shape, Better Shape

Bad: chat UI plus tools

User --> Chat UI --> Model --> any tool with service account

Failures:

No audit trail of which tool fired for which user.
No per-tenant or per-role tool list.
No eval set from real workflows.
Incidents debugged by reading raw prompts in logs.

Good: orchestrated agent system

User --> API gateway (auth) --> workflow engine --> model steps
                                      |
                                      +--> policy engine (allow/deny/approve)
                                      +--> tool broker (scoped credentials)
                                      +--> trace + eval hooks

Properties:

Each request has identity, correlation id, and policy version.
Tools run with least privilege, not platform admin.
Human approval is a first-class state, not a Slack message.
Eval jobs replay de-identified production traces after changes.

The good pattern does not require a specific vendor stack. It requires the same architectural habits you already use for customer-facing APIs: authentication at the edge, explicit contracts between components, and change management that includes regression checks before full rollout.

What the Pattern Teaches

Models got cheaper and faster. Reliability did not become automatic.

Production agents fail for the same reasons microservices failed when teams shipped services without tracing, idempotency, or ownership. The model is one component. System design is the product.

CTOs should ask the same questions they ask for a new payments API:

Who can call it?
What breaks if it is wrong?
How do we detect regression?
How do we shut it off without taking down the whole site?

Funding should follow the same bar. If you would not ship a payments endpoint without integration tests and an incident runbook, do not ship a funds-adjacent agent without the agent equivalent. The model is not a shortcut around platform engineering.

Worked Example: SaaS support agent

A B2B SaaS team shipped an agent that could "look up subscription status" and "apply credits."

Design	Credit tool	Result
Demo	admin API key in agent	credits applied without ticket link
Production v1	read-only subscription lookup	safe, limited value
Production v2	credit tool with approval queue + audit	ops trusts automation

The model did not change between v1 and v2. System design changed.

Between v1 and v2 the team also gained measurable confidence. Support leads could see which credits were proposed, which were approved, and which user identity triggered each action. That visibility turned the agent from a demo curiosity into something ops would route real tickets through.

Where This Shows Up: fintech and SaaS

Fintech: agents that touch client data or money movement need segregation of duties and immutable audit logs. A fluent answer that triggers the wrong ledger adjustment is a Sev1, not a UX bug.

SaaS: multi-tenant agents must not leak context across tenants. Workflow engines enforce tenant id on every retrieval and tool call, not hope in the prompt.

The pattern is identical: orchestration and policy before prompt tuning.

Where Production agent system design Control Belongs

Demo agent:
User --> Chat UI --> Model --> tools with shared token

Production agent:
User --> Auth gateway --> Workflow engine
                         |--> Policy engine
                         |--> Context plane
                         |--> Tool broker with scoped credentials
                         \\--> Traces, evals, rollback controls

The Autonomy Tradeoff

The tradeoff is autonomy versus reversibility. The more irreversible the action, the more the architecture should resemble a controlled workflow and the less it should resemble an open-ended conversation.

For production agent system design, the useful review is not a generic architecture checklist. It should inspect permission, context, tool behavior, eval evidence, and rollback. If those fields are missing, the team may still be busy, but leadership does not yet have a decision-quality artifact.

What Leaders Should Inspect

For AI systems, the release review should inspect the path taken before the model produces text. Start with identity. Which user, tenant, role, policy version, and task type entered the workflow? Which sources became eligible because of that identity? Which sources were rejected? If the team cannot answer from logs, the system is not auditable enough for production.

Next review context freshness. Static embeddings, vector indexes, document stores, CRM fields, and tool results all need owners and maximum age. A model answering from a stale policy is not hallucinating in the usual sense. It is faithfully using bad evidence. Critical sources should fail closed when freshness checks fail. Less critical sources can degrade with labels, but the degradation should be deliberate and visible.

Then review tools by blast radius. Read-only tools still leak data if scope is broad. Draft tools create review burden. Write tools change state and need idempotency, approval thresholds, scoped credentials, and rollback behavior. A shared service token is an architectural smell because it erases the user on whose behalf the action happened.

Finally, inspect evals and incident controls. Evals should replay de-identified production-shaped traces and score source selection, permission compliance, tool choice, and outcome. Kill switches should be granular: disable a credit tool, freeze memory refresh, force human approval, or route one intent class away from automation without taking down harmless read-only use cases.

Bad Paths to Test

AI release tests should include permission denial, empty retrieval, stale retrieval, wrong-tenant retrieval, tool timeout, duplicate tool retry, malformed tool output, long conversation memory, and user attempts to override policy. Test cost and latency under realistic tool chains, not only one warm prompt. Test whether the answer cites the source that actually justified the action.

For agentic workflows, include partial completion. The model may draft a response after a tool failed, or execute one side effect before another tool times out. The workflow must know whether to compensate, ask for approval, fail closed, or resume from a durable state.

Paths Worth Comparing for Production agent system design

Keep early agents read-only when trust is low. Let agents draft actions for human approval before granting autonomy. Block or dual-control irreversible actions such as funds movement, account closure, claims approval, or high-value credits.

In production agent system design, the alternative paths are not steps on a ladder. Each one carries a different mix of risk, cost, and learning. The weak choice is the one that hides the tradeoff until users, operators, or auditors discover it for you.

The Rule for Production Agents

A production agent is a distributed system with a model inside it. Treat it as a chat feature and the first serious incident will be about identity, tools, retries, memory, or rollback. The practical lesson is to demand evidence that fits production agent system design, not a universal checklist. The artifact should expose permission, context, tool behavior, eval evidence, and rollback clearly enough for another team to challenge the decision.

If production agent system design is the decision in front of your team, use the Codebase Context Scan to pressure-test the boundary before it hardens.

Recommended for you

Agentic AI · SaaS

AI POCs Need Exit Criteria Before They Become Permanent Pilots

Bala Velayutham

15 DECEMBER 2025

A POC without exit criteria becomes a permanent pilot: interesting enough to demo, too fragile to fund, and never safe enough to operate.

Read article

Agentic AI · Fintech

AI Agents Need Permission Tiers Before They Touch Production Systems

Bala Velayutham

1 DECEMBER 2025

Agents need autonomy tiers. Read, suggest, and act workflows carry different blast radius, audit, approval, and segregation-of-duties requirements.

Read article

Agentic AI · Healthtech

AI Eval Sets Should Come From Production Workflows, Not Demo Prompts

Bala Velayutham

17 NOVEMBER 2025

Demo prompts prove the demo still works. Production evals need real workflow traces, expected tool behavior, policy checks, and regression gates.

Read article

Codebase Context Scan

Documentation sample + 2–3 use case ideas.

Book a free working sessionBook a free working session

Why Most AI Agents Fail in Production: The Missing Layer Is System Design

A Chat UI Is Not the System

The Missing Layer Is Not Prompting

A Support Agent With Too Much Reach

The Agent as a Distributed System

Where Teams Misread the System

Demos optimize for the happy path

Tools multiply blast radius

No kill switch

Bad Shape, Better Shape

Bad: chat UI plus tools

Good: orchestrated agent system

What the Pattern Teaches

Worked Example: SaaS support agent

Where This Shows Up: fintech and SaaS

Where Production agent system design Control Belongs

The Autonomy Tradeoff

What Leaders Should Inspect

Bad Paths to Test

Paths Worth Comparing for Production agent system design

The Rule for Production Agents

Recommended for you

AI POCs Need Exit Criteria Before They Become Permanent Pilots

AI Agents Need Permission Tiers Before They Touch Production Systems

AI Eval Sets Should Come From Production Workflows, Not Demo Prompts

Codebase Context Scan