← All posts

A Practical Guide to Building AI Agents That Actually Ship

Most agent frameworks produce impressive demos that collapse in production. Here's how we built an Agent Operating System with two-layer architecture, wave-based parallel execution, and evaluation gates — and how you can wire up agents that actually ship.

Gal Hindi

April 6, 2026 · 5 min read

Updated May 15, 2026

Every week a new agent framework drops on GitHub. Every week someone posts a video of an LLM "autonomously" building a to-do app. And every week, engineering teams try to take that energy into production and hit the same wall: the gap between a demo that streams tokens and a system that runs reliably, handles failures, coordinates multiple agents, and gives humans visibility into what's happening is enormous. We spent months closing that gap. This is a practical walkthrough of what we built and how it works.

The Problem with Single-Loop Agents

The standard recipe is straightforward: call streamText with a system prompt and some tools, let the model loop until it's done. This works until it doesn't. A single-loop agent can't decompose a complex goal into parallel workstreams. It can't evaluate its own output against a rubric before declaring success. It can't hand off a subtask to a specialized model and get the result back. And when it fails at step 47 of a 50-step plan, you have no structured way to understand what happened. We needed something with more bones.

Two Layers, One System

Our Agent Operating System (AOS) splits the world into two layers. The Substrate Layer handles direct LLM execution — Vercel AI SDK's streamText, tool calling, session management, and memory. If you just need an agent that answers questions and calls a few tools, the substrate is all you touch. It's fast, simple, and stateless between calls.

The AOS Layer sits above and handles everything that makes agents production-grade: planning, task graphs, artifact management, evaluation gates, and multi-agent coordination. When a run starts, the AOS planner decomposes a goal into a directed acyclic graph of task nodes. Independent nodes execute concurrently in waves. Dependent nodes wait for their upstream results. The substrate does the actual LLM work; the AOS layer decides what work to do and when.

Defining an Agent

Rather than asking teams to wire everything from scratch, we ship agent archetypes — pre-built factories that encode domain knowledge into the orchestration layer. We currently have four: a Monitoring Agent for infrastructure investigation (log queries, metric correlation, incident reports), an App Builder Agent that coordinates Designer, Architect, Builder, and QA sub-agents, a Marketing Agent with a 7-stage content pipeline and Instagram publishing, and a PR Review Agent for automated code review.

Each archetype defines its own task node types, artifact schemas, evaluation criteria, and tool permissions. You instantiate one, give it a goal, and the AOS handles the rest. But archetypes aren't black boxes — every component is overridable. Swap the planner, add custom evaluation gates, wire in additional MCP tools. The archetype gives you a running start; you own the finish line.

The Run Lifecycle

When you kick off an agent run, it moves through a well-defined lifecycle: QUEUED, PLANNING, RUNNING, WAITING, and finally COMPLETED, FAILED, or CANCELLED. The PLANNING phase is where the AOS planner analyzes the goal and produces a task DAG. Each node in the graph has a type — REASON for pure inference, TOOL for tool calls, SUBGOAL for recursive decomposition, HANDOFF for delegating to another agent, EVALUATION for quality gates, ARTIFACT_TRANSFORM for modifying outputs, and SANDBOX_JOB for isolated code execution.

The planner groups independent nodes into waves. Wave 1 might run three research tasks in parallel. Wave 2 synthesizes those results. Wave 3 produces a draft artifact. Wave 4 evaluates it. This isn't speculative — the DAG is concrete, inspectable, and logged before execution begins.

Wiring External Tools with MCP

Agents are only as useful as the tools they can reach. We use the Model Context Protocol (MCP) to connect agents to external services — Google Workspace, Instagram, GitHub, and anything else with an API. Each integration runs as an MCP server that exposes typed tools the agent can call.

The critical piece is per-agent OAuth. Our integration-service manages scoped tokens so that each agent only has access to the integrations explicitly granted to it. A Marketing Agent can post to Instagram but can't read your Google Drive unless you wire that integration in. This isn't role-based — it's agent-scoped, which matters when you're running dozens of agents across an organization.

Evaluation Gates: The Production Differentiator

Here's what separates a demo from a system you'd trust with real work: evaluation gates. Before an agent run can transition from producing artifacts to declaring completion, it passes through one or more evaluation nodes. We support five gate types:

RUBRIC — an LLM-as-judge evaluates output against defined criteria and scores it. ARTIFACT_DIFF — compares the produced artifact against a reference or previous version. SCHEMA_VALIDATION — validates structured output against a JSON schema or protobuf definition. POLICY_CHECK — enforces organizational rules (no secrets in output, must include error handling, etc.). HUMAN — pauses the run and moves it to WAITING until a human approves or rejects.

Gates are composable. A PR Review Agent might run SCHEMA_VALIDATION on its structured review, then POLICY_CHECK to ensure it follows your team's review guidelines, then RUBRIC to score the review quality. If any gate fails, the run loops back to a repair node rather than emitting garbage.

Observing Runs in Real Time

Every run emits streaming events: status_changed, plan_created, node_text_delta, node_tool_call, artifact_produced, run_completed. Our Agent Runs UI consumes these events and renders a live view of execution — you can see which nodes are active, watch tool calls happen, inspect intermediate artifacts, and trace failures back to the specific node that produced them.

This isn't a nice-to-have. When an agent run fails at step 23, you need to know whether it was a bad tool response, a flawed plan decomposition, or a model hallucination. Structured runs with typed events give you that. A single while (true) loop with console.log does not.

The Iteration Loop

Building agents that ship is not a one-shot exercise. The pattern we've settled into is: define the archetype and its tool surface, wire the integrations and evaluation gates, run against real inputs, observe the task graph and streaming events, and iterate on the planner prompts, gate thresholds, and tool definitions. The system is designed so that each of these steps is independently tunable. You don't have to re-architect when a prompt change doesn't land — you adjust the specific node and re-run.

The uncomfortable truth about AI agents in 2026 is that the hard part was never the LLM call. It's everything around it — decomposition, parallelism, evaluation, observability, and access control. If you're building agents that need to work beyond a demo, you need structure. We built AOS to be that structure.