Building Production Multi-Agent Systems on AWS Bedrock

A single LLM call solves a single problem. The interesting work begins when you need an agent that can plan, call tools, recover from errors, and hand off to specialists. That's multi-agent territory, and AWS Bedrock with AgentCore Runtime is where I've spent most of the last year.

This post is a field report from production deployments. Not architecture diagrams. The patterns that worked, the ones that didn't, and what I'd build differently next time.

The orchestrator pattern

The simplest multi-agent design has one orchestrator and several specialists. The orchestrator never answers a user question itself. It reads the query, decides which specialist to invoke, passes the right context, and assembles the response.

Why this beats one big agent: specialists can have tight prompts, narrow tool access, and bespoke evals. The orchestrator stays small and fast. When you need a new capability, you add a specialist. The orchestrator gains one more route, not a 200-token addition to its system prompt.

At BPX our specialists include a well-data agent, a routing agent for field technicians, and a Text-to-SQL agent over Snowflake. Each one knows nothing about the others. The orchestrator knows them all.

Session isolation

AgentCore Runtime gives you per-session isolation out of the box. Every conversation runs in its own sandbox with its own memory and identity context. This sounds boring until you have two concurrent users hitting the same agent with conflicting goals, and you watch one user's data leak into the other's response.

The pattern: session ID flows from API Gateway through Lambda into AgentCore as a header. AgentCore handles isolation. You don't.

Memory that matters

There are three memory types worth distinguishing: scratchpad memory (within one turn), episodic memory (across turns in a session), and long-term memory (across sessions).

AgentCore Memory handles the first two natively. Long-term memory is your job, usually via a vector store keyed by user ID. The mistake I see most: agents that try to remember everything. Be aggressive about forgetting. Memory you don't trust is worse than memory you don't have.

Identity and observability

OAuth-based identity through AgentCore Identity lets agents act on behalf of users without storing credentials. This is huge for enterprise. The agent can call Snowflake or an internal API as the user, with the user's permissions, and audit logs reflect reality.

For observability we use OpenTelemetry traces piped to CloudWatch. The dashboards I check daily: token usage per agent per session, latency P50/P95/P99, and goal success rate (did the agent finish what it was asked to do?). The third one is the only metric that actually measures whether the system is working.

What I'd do differently

Start with the orchestrator and one specialist. Resist the urge to design the full agent graph upfront. You will be wrong about which specialists you need. Get one path working end to end, then add.

Invest in evals before you invest in features. The agent that's 90% reliable today is the agent that's 90% reliable in six months, unless you have a way to measure regressions.

And keep the orchestrator stupid. Every time someone proposes adding logic to the orchestrator, push it down into a specialist. The orchestrator's job is routing, not reasoning.