Most agentic AI demos look impressive. Production deployments are a different problem entirely. Agents that browse the web, write code, call APIs, and coordinate with other agents are powerful — and they fail in ways that are hard to anticipate and harder to debug.
This post covers the architecture decisions that separate brittle prototypes from systems you can actually operate.
What makes agentic systems different
A traditional API call is deterministic. You send a payload, you get a response, you handle errors. An agent loop is a feedback system: an LLM decides what tool to call, gets a result, decides what to do next, and so on for N steps until it reaches a goal or hits a limit.
The failure modes are different:
- Hallucinated tool calls — the model invents arguments or calls tools that don't exist.
- Goal drift — the agent pursues a proxy objective instead of the real one.
- Infinite loops — a malformed observation sends the agent into a cycle it never exits.
- Context window exhaustion — long conversations fill the context and degrade reasoning quality.
- Cascading failures — one bad tool call produces a bad observation that corrupts all downstream reasoning.
None of these show up reliably in a five-minute demo. They show up after you've run a thousand tasks in production.
Design the task boundary first
Before you write any agent code, define the task boundary precisely:
- Entry condition — what triggers this agent, and what data is it given at start?
- Exit condition — what constitutes success? What constitutes a terminal failure?
- Side-effect budget — which external systems can it write to, and what are the rollback paths?
- Escalation path — when does the agent hand off to a human?
Agents that lack clear exit conditions are the most common source of runaway costs. Cap token budgets, step counts, and wall-clock time independently.
Tool design: the hidden bottleneck
The model's reasoning is only as good as the tool interfaces it calls. Poorly designed tools are the single biggest cause of agentic failures in real deployments.
Idempotency is non-negotiable. If your agent retries a failed step (and it will), a non-idempotent tool can double-charge a customer, send a duplicate email, or create two records. Every tool the agent calls should be safe to call twice with the same arguments.
Return structured, bounded results. Don't return raw HTML from a web scrape when a structured summary is what the agent needs. Large, noisy tool responses overwhelm the context window and degrade reasoning accuracy.
Fail loudly with context. Error messages like "500 Internal Server Error" tell the agent nothing. Error messages like "Order 7294 was not found in the staging database. Did you mean to query production?" give it a path forward.
A minimal tool schema looks like this:
{
"name": "create_invoice",
"description": "Creates a draft invoice and returns the invoice ID. Idempotent: calling twice with the same order_id returns the existing draft.",
"parameters": {
"order_id": { "type": "string", "description": "The confirmed order ID from the orders service." },
"line_items": { "type": "array", "description": "Array of { sku, quantity, unit_price } objects." }
},
"returns": {
"invoice_id": "string",
"status": "draft | submitted | error",
"error_detail": "string | null"
}
}
Multi-agent orchestration: keep the graph flat
Multi-agent systems introduce coordination overhead. The temptation is to build deep hierarchies: an orchestrator spawns sub-orchestrators which spawn workers. Resist this.
Flat is easier to observe. A single orchestrator with direct connections to specialist agents is far easier to trace than a tree structure three levels deep.
Message formats matter. Agents that communicate via structured JSON with explicit schemas degrade gracefully. Agents that pass free-form prose between each other accumulate ambiguity with every hop.
Version your agent contracts. When you upgrade the coding agent, the QA agent should not break silently. Treat inter-agent message schemas like API contracts: version them, test them, and deprecate old versions explicitly.
A minimal multi-agent topology for a software delivery workflow:
Orchestrator
├── PlanningAgent (breaks task into subtasks, returns task list)
├── CodingAgent (writes code given a spec, returns file diffs)
├── ReviewAgent (reviews diffs for correctness and security)
└── DeployAgent (applies approved diffs to staging or production)
Each agent knows only its own tools. The orchestrator holds state and routes results. No agent calls another directly.
Observability: you must be able to see what happened
Production agents need the same observability stack as any distributed system — arguably more, because the failure modes are harder to reason about.
Log every LLM call with full context. Store the model version, the system prompt hash, the full message history at the time of the call, the raw response, and the latency. When something goes wrong at step 7 of a 12-step chain, you need to reconstruct exactly what the model saw.
Trace tool calls with span IDs. Each tool invocation should emit a span with: tool name, arguments (sanitized of PII), result size, latency, and success/failure. This is standard distributed tracing — just apply it to agent tool calls.
Emit task-level metrics. Track success rate, mean step count, P95 latency, and token cost per task type. Sudden changes in step count often indicate a prompt regression before it appears in success rates.
A minimal observability schema for a task run:
{
"task_id": "tsk_8f2c1a",
"task_type": "invoice_generation",
"start_time": "2026-01-15T10:22:44Z",
"end_time": "2026-01-15T10:22:59Z",
"steps": 6,
"input_tokens": 1840,
"output_tokens": 512,
"tool_calls": [
{ "tool": "get_order", "latency_ms": 42, "success": true },
{ "tool": "create_invoice", "latency_ms": 118, "success": true },
{ "tool": "notify_finance", "latency_ms": 31, "success": true }
],
"outcome": "success",
"error": null
}
Safety: build the guardrails before you need them
Input validation at the agent boundary. Validate and sanitize everything before it enters an agent loop. Prompt injection — where malicious content in a tool result hijacks the agent's next action — is a real threat for agents that browse the web or process user-submitted documents.
Confirmation checkpoints for irreversible actions. Any action that writes to production, sends an external message, or costs money should require an explicit confirmation step. This can be a human-in-the-loop review or an automated policy check.
Hard resource limits, not soft warnings. Max tokens: hard stop. Max steps: hard stop. Max API spend per task: hard stop. Soft limits that emit warnings are ignored in production under load.
Audit logs are not optional. Every state transition, every tool call, every decision point should be logged to an immutable store. When regulators or customers ask "what did the agent do and why", you need a complete answer.
The operational checklist
Before moving an agentic system from staging to production:
- [ ] Every tool is idempotent and tested with duplicate calls
- [ ] Hard limits on tokens, steps, and wall-clock time are enforced
- [ ] Distributed tracing is in place for all tool calls
- [ ] Full LLM call logs are stored for at least 30 days
- [ ] Prompt injection surfaces have been tested with adversarial inputs
- [ ] There is a kill switch that stops all running tasks within 60 seconds
- [ ] Escalation paths to humans are defined and tested
- [ ] Task-level cost accounting is in place
Agentic AI is a genuinely powerful paradigm. But it is a distributed system with a non-deterministic component at its core — and it deserves the same engineering discipline as any other system you'd run in production.
Related: explore more under Agentic AI & Multi-Agent Systems on the insights hub.