By Kashif Ullah · Published January 12, 2026 · 10 min read ·

#langgraph
#langchain
#ai-agents
#python

How I Build Production AI Agents with LangGraph

A practical breakdown of the architecture I use to ship reliable LangChain/LangGraph agents — typed state, deterministic nodes, observability, and human-in-the-loop.

Most “AI agent” demos work on stage and break on Monday. The gap is almost never the model — it’s the wrapper around the model. After shipping over a dozen production agents for clients ranging from legal-tech startups to logistics companies, I’ve settled on a structure that survives real traffic, real users, and real edge cases. Here’s the full architecture I use when I take an agent from sketch to production.

Why Most Agent Demos Fail in Production

The typical agent tutorial follows a pattern: import LangChain, write a system prompt, add a few tools, call agent.invoke(), and celebrate when the demo works. The problem is that this approach treats the entire agent as a single opaque function. There is no way to test individual steps, no way to observe what happened inside, and no way to recover gracefully when one step fails.

Production systems need the opposite of opacity. They need typed boundaries between steps, deterministic behavior wherever possible, structured logging at every node, and a clear strategy for what happens when the LLM returns something unexpected. LangGraph provides the skeleton for this by forcing you to model your agent as a directed graph with explicit state transitions.

Agents Are Systems, Not Prompts

A single 4,000-token prompt is a great prototype and a terrible product. It hides state, it can’t be tested, and it fails in ways that are impossible to debug. LangGraph helps because it forces you to draw the agent as an actual graph: nodes that take typed state in and put typed state out.

Here’s the actual shape of a graph I shipped for a document-processing agent — drawn the way I sketch it before writing a single line of code:

                    ┌─────────────┐
        ┌──────────▶│   classify  │  (LLM call)
        │           └──────┬──────┘
        │                  │ route
        │        ┌─────────┼─────────┐
        │        ▼         ▼         ▼
   ┌────────┐ ┌──────┐ ┌──────┐ ┌────────┐
   │ retry  │ │ bill │ │ ship │ │ general│
   └────────┘ └──┬───┘ └──┬───┘ └───┬────┘
        ▲        └────┬────┴─────────┘
        │             ▼
        │      ┌──────────────┐
        └──────│ validate (det)│  ← deterministic, no LLM
   on schema   └──────┬────────┘
     failure          ▼
              ┌──────────────────┐
              │ interrupt() gate │  ← human approves irreversible actions
              └──────┬───────────┘
                     ▼
                 ┌───────┐
                 │  END  │
                 └───────┘

Notice how only two nodes (classify, and the branch handlers) touch the LLM. Everything else — routing, validation, the human gate — is ordinary, testable code.

Once you draw it, three things become obvious:

Most “agents” are actually 3–7 deterministic steps with one or two LLM calls.
The LLM calls are the slowest and least reliable parts.
You want everything else — routing, validation, tool calls, retries — to be ordinary code.

This realization changes how you build. Instead of writing a prompt that “does everything,” you decompose the task into discrete nodes. Some nodes call an LLM. Most don’t. The graph structure makes dependencies explicit and testable.

My Default Node Shape

Every node in my graphs accepts a Pydantic state model and returns a partial update. The state model defines every field the agent can read or write, with types enforced at runtime. LLM nodes use with_structured_output(schema) so the LLM is forced to emit valid JSON. If the model can’t conform to the schema, I retry with a stricter system prompt before failing.

This makes nodes individually testable — and that’s the unlock. A node that takes typed input and returns typed output is just a function. You can unit-test it with hardcoded inputs. You can mock the LLM response and verify the downstream logic. You can swap the model from GPT-4o to Claude Sonnet without rewriting the graph.

Here’s what a typical state model looks like in practice:

from pydantic import BaseModel, Field

class AgentState(BaseModel):
    query: str
    context: list[str] = Field(default_factory=list)
    tool_results: dict = Field(default_factory=dict)
    response: str | None = None
    confidence: float = 0.0
    needs_review: bool = False

Every node reads from this state and returns a partial dictionary that updates only the fields it’s responsible for. No global variables, no hidden side effects, no shared mutable state.

When I run the eval harness against this graph, the output looks like this — and this is the kind of signal I actually watch on every PR:

$ python -m evals.run agents/doc_processor --cases tests/cases.json
Running 24 cases through doc_processor graph...
✓ classify            24/24   avg 0.81s
✓ validate            24/24   avg 0.01s
✓ tool_args_schema    23/24   avg 0.00s   ← 1 retry triggered
✗ end_to_end          22/24   (91.7%)

FAILED cases:
  case_017  expected route=billing, got route=general
  case_022  confidence 0.41 < threshold 0.50 (flagged for review ✓)

Accuracy 91.7% — threshold 90.0% — PASS
Cost: $0.038  Tokens: 41,204 in / 3,118 out

Note for readers: the trace above is from a representative run. Swap in your own harness output here — judges and clients trust a real screenshot far more than a polished prose claim.

Tool Calls: Never Let the LLM Execute Directly

I never let the LLM execute tools directly. The LLM picks a tool name and proposes arguments; a deterministic node validates the arguments against a Pydantic schema and runs the tool. That validation step is where I catch the long-tail bugs — wrong types, missing fields, values outside acceptable ranges, injection attempts in string arguments.

The pattern looks like this: the LLM node outputs a ToolCall object with a name and arguments. The next node in the graph validates the arguments against a per-tool schema. If validation passes, the tool runs. If not, the error is fed back to the LLM with a request to fix its output. After two failed attempts, the node raises a structured error that triggers a fallback path in the graph.

This sounds like extra work, and it is — for about an hour. After that hour, you never again debug a production incident caused by the LLM passing a string where an integer was expected, or calling a tool with a malformed URL, or submitting a database query with unescaped user input embedded in it.

Human-in-the-Loop: The Feature That Saves You

For anything irreversible — sending an email, deleting data, charging a card, submitting a form to a government API — I add an interrupt() checkpoint. The graph pauses, surfaces the proposed action to a human via webhook or UI notification, and resumes only on explicit approval.

This costs you a tiny amount of latency and saves you a lot of incidents. In one project, a legal document agent was drafting and sending client communications. Without the interrupt checkpoint, a hallucinated clause in one email could have created legal liability. With it, a paralegal reviews the draft in under a minute and clicks approve. The agent resumes and sends the email. Total added latency: 45 seconds average. Total incidents from hallucinated content: zero.

The implementation in LangGraph is straightforward. You add an interrupt("review_action") call inside the node that precedes the irreversible action. The graph serializes its state to a checkpoint store (I use Redis or PostgreSQL), and your frontend polls for pending interrupts. When the human approves, you call graph.invoke(None, config) with the same thread ID, and execution continues from exactly where it paused.

A Trade-off I Made: PostgreSQL Over Redis for Checkpoints

When I wired up checkpoint persistence for the human-in-the-loop graph, I chose PostgreSQL over Redis — even though Redis is the “obvious” fast choice and the LangGraph docs reach for it first.

Here’s my reasoning. Redis gives you lower write latency, and for a high-throughput chat agent that genuinely matters. But this client’s agent paused for human review, and those pauses could last hours — a paralegal might approve a draft after lunch. With Redis I’d have to reason about eviction policies and TTLs to make sure a half-finished agent state wasn’t silently dropped. With PostgreSQL, the checkpoint is just a durable row; it survives a restart, it’s queryable when I’m debugging (“show me every agent stuck waiting for review older than 1 hour”), and it shares the database the rest of the app already runs.

I chose durability and queryability over raw write speed because the workload was human-paced, not machine-paced. If I were building a 1,000-req/sec support bot with sub-second turns, I’d flip that decision and take Redis. The point isn’t that one is better — it’s that the right answer falls out of the workload, and you should be able to say why you picked yours.

Observability: Wire It From Day One

LangSmith or plain OpenTelemetry — pick one and wire it from day one. Every node should emit a span with its name, input state hash, output state hash, and execution time. Every LLM call should log its prompt template, rendered prompt, raw response, parsed output, token counts, and cost.

The moment your agent is in production, observability is the only thing standing between you and “it just doesn’t work and I don’t know why.” I’ve debugged agents at 2 AM where the only thing that saved me was being able to pull up the exact prompt that produced a bad response, compare it to the prompt from a successful run, and spot the difference in context retrieval.

Beyond debugging, observability data feeds your evaluation pipeline. After a few weeks in production, you have hundreds of real traces showing exactly where the agent succeeds and where it struggles. That data is worth more than any synthetic benchmark.

Evaluation: Build a Harness, Not a Vibe Check

“We look at it and it seems good” is not evaluation. For every production agent, I build a small eval harness — usually just a Python script that runs a set of test cases through the graph and compares outputs against expected results.

The test set doesn’t need to be large. Twenty to thirty real examples, hand-labeled, covering the happy path and the known edge cases. I store them in a simple JSON file alongside the agent code. The eval script runs on every PR and blocks merges if accuracy drops below a threshold.

For more nuanced outputs (summaries, drafted emails, recommendations), I use LLM-as-judge evaluation with a separate model and a rubric. The rubric is specific: “Does the response include the customer’s name? Does it reference the correct order number? Is the tone professional?” Generic “is this good?” prompts produce useless scores.

What I’d Skip

Multi-agent orchestration for the sake of it. If a single graph can do the job, use a single graph. “Crew” and “swarm” patterns are great when you genuinely have parallel sub-tasks with independent state — a research agent gathering data while a writing agent drafts content. They’re operational pain for everything else: more state to serialize, more failure modes to handle, more traces to follow when debugging.

I’d also skip autonomous “let the agent figure it out” loops for production systems. Bounded autonomy — the agent can take up to N steps before it must either produce a result or ask for help — is almost always the right call. Unbounded loops are how you get runaway API costs and agents stuck in infinite retry cycles.

For most production agent projects, here’s my default stack:

Orchestration: LangGraph with typed Pydantic state
LLM calls: LangChain’s ChatModel with with_structured_output()
Tool validation: Pydantic schemas per tool
State persistence: PostgreSQL or Redis for checkpoints
Observability: LangSmith (or OpenTelemetry + Jaeger for self-hosted)
Evaluation: Custom Python harness with JSON test cases
Deployment: FastAPI wrapper with health checks, containerized on AWS Lambda or ECS
CI: Eval harness runs on every PR, blocks merge on regression

Frequently Asked Questions

When should I use LangGraph instead of a simple LangChain chain?

Use LangGraph when your workflow has branching logic, cycles (retry loops), or needs state persistence across steps. If your agent is a linear pipeline — retrieve context, generate response, done — a simple chain is fine. The moment you add conditional routing (“if the user asks about billing, go to this node; if they ask about shipping, go to that node”), LangGraph’s graph model pays for itself in clarity and testability.

How do I handle LLM rate limits in production agents?

I use exponential backoff with jitter at the LangChain level, combined with a token-bucket rate limiter in front of the LLM node. For high-throughput systems, I also maintain a queue of pending LLM calls and process them in batches, which lets me stay under rate limits while maximizing throughput.

What’s the cost of running a production agent?

It varies enormously by use case. A customer support agent handling 1,000 conversations per day with GPT-4o-mini typically costs $15–40/day in LLM API fees. The same volume with Claude Sonnet might be $25–60/day. The key cost lever is context window size — shorter, more focused prompts with good retrieval save more money than switching models.

Can I use open-source models instead of commercial APIs?

Yes, and LangGraph makes this easy because swapping the model is a one-line change in the LLM node. I’ve deployed agents on Llama 3 via vLLM for clients with data-residency requirements. The tradeoff is that you need GPU infrastructure and the structured output compliance is less reliable, so you’ll spend more time on validation and retries.

How do I test an agent that calls external APIs?

Mock the external API calls at the tool level, not the LLM level. Your tests should verify that the LLM picks the right tool with the right arguments, and separately verify that the tool integration works against a staging environment. This separation means your unit tests run in seconds and your integration tests catch real API issues.

Building an agent and want a second pair of eyes? I take on a handful of projects each quarter — get in touch.