If you've ever tried to build a production-grade LLM agent, you've likely hit the wall. At first, it feels like magic. You write a prompt, the agent follows instructions, and everything works. But as the complexity of the task grows, the magic fades. The agent starts drifting from objectives, hallucinating its current state, and completely losing the plot during long-running interactions.
The industry's knee-jerk reaction is to throw more context or more complex prompt engineering at the problem. But the real bottleneck isn't context size—it's reliability. An LLM given unlimited context still makes inconsistent decisions because it fundamentally lacks structural constraints. Prompt engineering is essentially trying to build a state machine out of natural language. It works 95% of the time, until it spectacularly doesn't, and you can never predict exactly which interaction will trigger the failure.
Enter Finite State Machines (FSMs). By combining the deterministic reliability of FSMs with the creative reasoning of Large Language Models, we get the FSM-LLM hybrid architecture. The premise is simple: the FSM handles the flow, and the LLM handles the wording. You get agents that are both rock-solid and highly natural. The FSM guarantees the conversation follows a valid, predictable path, while the LLM ensures the responses feel human rather than rigidly scripted.
I've used this exact architecture in production for conversational agents, workflow automation, and multi-step task execution. The biggest mindset shift required is realizing that LLMs are not replacements for rule-based systems. They are extraordinary navigators of state spaces and generators of structured outputs, but they are terrible at executing precise sequences of operations reliably. The FSM-LLM hybrid treats these as entirely separate concerns.
Why Pure LLM Agents Break
Imagine a customer service agent handling a multi-step return process. The customer asks about the return window, pivots to a technical question about the product, and then returns to the refund policy. A pure LLM agent will often lose track of where it is in the resolution process. It might provide inconsistent information, skip required verification steps, or get stuck in an infinite loop. It has no guaranteed termination condition.
This happens because the LLM has no structural model of the conversation. It is simply predicting the next token based on the context window. It does not maintain state in any rigorous, programmatic sense. Every response is a probabilistic generation with no enforced contract about what *must* happen, what *must not* happen, or when the interaction should conclude. This is perfectly fine for open-ended creative writing, but it is a disaster for structured business workflows where compliance, consistency, and completeness are non-negotiable.
The failure mode is particularly insidious because it's intermittent. The agent handles 95% of conversations flawlessly. The remaining 5% produce errors ranging from subtle inconsistencies to completely derailed, brand-damaging interactions. Engineering teams then spend weeks trying to prompt-engineer their way out of these edge cases, adding increasingly elaborate system prompts. This usually makes the problem worse by consuming the context window budget on structural instructions rather than task-relevant information.
The Hybrid Architecture
FSM-LLM Hybrid Flow
Deterministic state management controlling non-deterministic LLM calls
The FSM-LLM hybrid solves this by splitting responsibilities cleanly. The FSM manages the conversation state, defines valid transitions, and enforces termination conditions. The LLM generates natural language responses strictly within the bounds defined by each state.
Think of it this way: The FSM asks, *"What state are we in, and what transitions are valid?"* The LLM answers, *"Given this state and context, what should we say?"*
Hybrid Interaction Flow
Structural flow of data and control
The main design choice is the constraint schema: the LLM's output is structured and validated by the FSM before execution. The LLM cannot trigger invalid state transitions because the FSM rejects them and requests a new output. This is the main safety boundary — it prevents the LLM from taking invalid actions.
FSM Design for Conversational Agents
Conversational Agent States
FSM transitions driving the interaction lifecycle
A conversational FSM defines the stages of interaction as an enumeration of valid states with explicit transitions between them. Each transition is triggered by an intent classification from the user's input.
The FSM guarantees the conversation follows a valid path. It cannot skip required steps or end up in undefined states. Every transition is explicit, auditable, and testable with conventional unit tests. When something goes wrong, the state history provides a complete trace of what happened and why.
State design requires judgment. States should represent meaningful checkpoints in the workflow, not individual micro-operations. If every decision requires a state transition, the FSM becomes unmanageable. The right granularity captures the major phases of the interaction while leaving room for the LLM to handle variation within each phase.
LLM Integration Within States
Each state has an associated prompt template that provides the LLM with context and constraints for generating a response. The template specifies the current goal, the tone, the available information, and the expected output format.
The LLM generates natural language within the bounds defined by the FSM state. It has creative freedom for expression, not for structure. That distinction matters. The LLM can choose how to phrase an objection response — it can decide whether to empathize first or present data first, whether to use an analogy or a direct comparison. What it cannot do is decide to skip the objection handling state and jump directly to closing. That structural decision belongs to the FSM.
Latency Optimization
Dynamic Model Routing
Intelligently routing prompts based on task complexity and cost
Real-time conversations cannot tolerate multi-second delays. The hybrid architecture enables latency optimization that pure LLM systems cannot achieve because the FSM provides information about complexity before the LLM runs.
State-based model selection routes simple states (greetings, confirmations) to a fast small language model with approximately 100 millisecond inference time. Complex states (objection handling, closing) route to a larger, more capable model with approximately 1 second inference. This routing often cuts average latency by roughly 70% without sacrificing quality on the interactions that matter most.
Semantic caching adds another optimization layer. Common queries (questions about business hours, return policies, standard pricing) hit the cache instantly. Only novel queries that differ substantially from cached entries require full inference. A similarity threshold of 0.95 provides cache hits for semantically identical queries phrased differently while ensuring genuinely new questions receive fresh responses.
Cost Control
LLM API costs scale with token usage. The FSM architecture provides three structural advantages for cost control. First, the FSM handles logic in code rather than in prompts. State transitions happen deterministically — no tokens consumed. Second, context pruning becomes possible because the FSM tracks what information is relevant to each state. An objection handling state needs the objection details and recent conversation turns, not the entire history. A closing state needs identified interests and resolved objections, not the raw conversation transcript.
This context pruning typically reduces token usage by around 60% compared to passing the full conversation history at every turn. Over millions of interactions, the cost difference is substantial. Third, prompt templating with reusable templates reduces the fixed overhead per interaction. The template structure is defined once; only the variable context changes between invocations.
Fine-Tuning for Domain Performance
Generic LLMs lack domain expertise. Fine-tuning bridges this gap through three mechanisms. Supervised Fine-Tuning (SFT) trains the model on labeled input-output pairs from your specific domain, teaching it the vocabulary, conventions, and style expected in your context. Parameter-Efficient Fine-Tuning (PEFT) with LoRA reduces computational cost while preserving the model's base capabilities — you adapt the model rather than replacing it. Reinforcement Learning from Human Feedback (RLHF) aligns outputs with human preferences by training a reward model on ranked responses and fine-tuning the LLM to maximize that reward.
The result is an LLM that speaks your domain's language and follows your domain's conventions — a model that knows your product catalog, your pricing logic, your objection handling playbook, and your brand voice.
Good Fit vs. Overkill
The FSM-LLM hybrid excels when the workflow has defined stages and decision points, when inconsistent behavior has real costs, when conversations span multiple turns with state carryover, and when the domain requires specialized language that generic LLMs do not possess. It is overkill for single-turn interactions with no state to manage, creative generation tasks where structure constrains the output, and exploratory conversations with no defined path to follow.
The Complete Agent
By decoupling the structural flow from the language generation, the hybrid architecture plays to the strengths of both paradigms. The FSM handles the rigid structure, providing the determinism, testability, and auditability that traditional software engineering demands. The LLM handles the language, delivering the fluency, adaptability, and naturalness that users expect. Together, they create agents that are both highly predictable and remarkably human-sounding—which, in my experience, is the only durable foundation for autonomous systems operating in the real world.
In my experience, this architecture noticeably improves production reliability compared to pure-LLM agents. The reason is simple: you preserve the debuggability and predictability of traditional software while gaining the reasoning power of large language models. You can test state transitions with unit tests, trace failures through the state history, and measure coverage of your transition table. Pure LLM agents don't give you that, which is why they're fragile in production.
Implementation Guide
This section explains how to implement the FSM-LLM hybrid described in the article in concrete code: defining the FSM, wiring the LLM per state, using a model router, and managing prompts. The two reference implementations use **LangGraph and CrewAI**; the same logical FSM and prompts apply to both, with framework-specific integration.
Sources: Architecture decisions and mapping are from the architecture documentation. Historical context for LangGraph and CrewAI is from framework documentation.
1. Define the FSM (States and Transitions)
The article’s hybrid uses a single source of truth for the FSM: explicit states and allowed transitions. Both companion repos implement the same machine.
States: idle → planning → drafting → review → revise (optional loop) → done.
Transitions: Deterministic where possible; the LLM is used only for *content* decisions (e.g., “is revision needed?”), not for state-machine correctness. Allowed transitions:
| From | To |
|---|---|
| idle | planning |
| planning | drafting |
| drafting | review |
| review | revise or done |
| revise | review |
| done | (none) |
In code, define an enum (or string constants) and a transition map, e.g. in fsm_spec.py:
Keep state payload minimal: e.g. current_state, plan, draft, review_notes, revision_count, and optionally a message list. The FSM drives *which* step runs; the payload holds inputs and outputs for each step.
2. LLM Integration Per State
Each state that produces or judges content calls the LLM through a single abstraction: a model router (see below). The FSM never asks the LLM “what state next?”—only “what content?” (e.g., plan text, draft text, “approved or not,” revision text).
- Planning: LLM generates a structured plan/outline from the brief; output is stored in state (e.g.
plan). - Drafting: LLM produces a draft from the plan (and brief); output is
draft. - Review: LLM evaluates the draft against the brief and returns review notes and a pass/revise decision; that decision drives the FSM transition (revise vs done).
- Revise: LLM revises the draft using the review notes; output updates
draft; then transition back to review.
Use template-based prompts (Jinja2 or f-strings with clear placeholders) so the same logical prompts can be shared across LangGraph and CrewAI; only message formatting (system vs user) may differ per framework.
3. Model Router
Introduce one model router that selects provider and model from configuration (env vars or a small config file). No hardcoded secrets.
- Interface: e.g.
get_model_for_step(step_name: str)returning the LLM runnable for that step. - Policy: Planning and drafting can use a “fast” or default model; review (and optionally revise) can use a “review” or quality model if desired.
- Config: Prefer
MODEL_PROVIDER(e.g.openai,anthropic) andMODEL_NAME(e.g.gpt-4o-mini,claude-3-haiku-20240307) plus the corresponding API key env vars.
Tests stay simple because you swap in a mock runnable instead of hitting the router with real keys.
4. Prompts
One template per step — planning, drafting, review, revise. Both repos use the same logical content so behavior lines up.
Example (planning): "Create a structured outline for this brief. Brief: {brief}. Output a concise, bulleted plan." Skip the "You are an expert X" boilerplate — the task is clear from context.
Templates live in files (prompts/planning.txt, drafting.txt, etc.) and load at runtime. Tune wording without touching code. Each step gets what it needs — brief, plan, draft, review_notes — pass them in where the template expects them.
5. LangGraph Mapping
LangGraph maps to the FSM like this:
- StateGraph: Build a
StateGraphwith a TypedDict state that includescurrent_state,plan,draft,review_notes,revision_count, and optionally a message list. Use reducers for list-like fields (e.g.messages) so appends do not overwrite. - Nodes: One node per “doing” state:
planning_node,drafting_node,review_node,revise_node. Idle can be implicit in the initial state; adonenode can format output if desired. - Edges: Straight edges
planning→drafting→review. A conditional edge afterreview: if revision needed →revise, else →END. Thenrevise→reviewto form the loop. Use a cap (e.g. max revisions) so the graph always terminates. - Persistence: For the CLI demo,
MemorySaveris enough; for resumable runs, useSqliteSaverorPostgresSaverwith athread_id.
In each node that calls the LLM, obtain the model from the shared router (e.g. get_model_for_step("planning")) so step-specific model selection is centralized and mockable.
6. CrewAI Mapping
CrewAI maps similarly:
- Agents: One “Writer” agent is enough for a minimal FSM-LLM; optionally separate Planner, Drafter, Reviewer, Reviser agents for clarity.
- Tasks: One task per FSM step (Planning, Drafting, Review, Revise). Each task’s
contextdepends on previous task outputs (e.g. Drafting uses the Planning output). - Process: Use a sequential process so order matches the FSM: Plan → Draft → Review → Revise.
- Conditional loop: CrewAI’s sequential flow does not natively support cycles. V1 can use a single pass: Plan → Draft → Review → Revise → Done. For the full FSM loop (review → revise → review again), the Implementation Guide recommends implementing a Flow that runs the Crew in a loop with a condition (documented in the architecture notes as the production path).
State is implicit in the chain of task outputs; for article alignment, document that “after the Draft task, FSM state = drafting,” etc. Configure each agent’s LLM via the shared model router (e.g. get_llm_for_step("planning")) so the same routing and testing strategy applies.
7. Why LangGraph and CrewAI?
LangGraph** was built specifically to support stateful, cyclic, multi-step LLM workflows. It provides a graph mental model with native conditional edges for loops (e.g., review → revise → review). This aligns directly with the FSM-LLM design: the states become nodes, and the review-revise loop is a first-class conditional edge.
CrewAI focuses on multi-agent collaboration with sequential or hierarchical execution. Its task-centric model maps naturally to FSM steps, where context flows from one task to the next. While its basic sequential process doesn't natively loop, CrewAI Flows** provide the necessary state and control flow to implement the complete FSM architecture in production.
8. Checklist for Your Own Implementation
- FSM spec: Define states and
ALLOWED_TRANSITIONSin one place (e.g.fsm_spec.py); both repos reference the same logic. - State payload: Minimal TypedDict (or equivalent) with
current_state,plan,draft,review_notes,revision_count, and any message list. - Model router: Single function (e.g.
get_model_for_step) reading provider/model from env (or config); use it in every node/task that calls the LLM. - Prompts: Template files for planning, drafting, review, revise; same logical text across implementations.
- LangGraph: StateGraph + one node per doing state + conditional edge review → revise/done + revise → review; optional MemorySaver or SqliteSaver.
- CrewAI: Sequential tasks for Plan, Draft, Review, Revise; single pass for V1; document Flow for the full review–revise loop.
- Tests: Unit tests for FSM transitions and model router (mocked LLM); integration test for the full pipeline with mocked LLM calls.
- CLI: One entrypoint (e.g.
python -m fsm_llm.cli --brief "...") that runs the pipeline and prints or saves the final plan, draft, and status.
Reference Implementations
Two reference implementations demonstrate the FSM-LLM hybrid from the article. They share the same logical FSM (states and transitions), prompt semantics, and model-router abstraction, and differ only in framework: **LangGraph (graph-based, with a native review–revise loop) and CrewAI** (task-based, single-pass V1 with an optional Flow path for the full loop). Both are runnable locally and include unit and integration tests with no live API calls when tests use mocks.
Repos:
- fsm-llm-langgraph: https://github.com/mikehenken/fsm-llm-langgraph
- fsm-llm-crewai: https://github.com/mikehenken/fsm-llm-crewai
1. fsm-llm-langgraph
Purpose: Implements the FSM-LLM pipeline using LangGraph: a StateGraph with one node per FSM state and a conditional edge for the review–revise loop.
Structure:
Run CLI demo:
Run tests:
Tests use mocks for LLM calls; no live API keys required for the test suite.
2. fsm-llm-crewai
Purpose: Implements the same FSM-LLM pipeline using CrewAI: sequential tasks (Plan → Draft → Review → Revise) with a single Writer (or separate Planner/Drafter/Reviewer/Reviser) and shared prompts. V1 is a single pass; the full review–revise loop can be implemented later with a CrewAI Flow.
Structure:
Run CLI demo:
Run tests:
All tests use mocks for the LLM and Crew kickoff; no live API calls. If both companion repos are installed on the same machine, use a dedicated virtualenv for this repo (or run pytest from this repo’s directory) so fsm_llm resolves to the CrewAI package.
3. Where to Find What
| Item | fsm-llm-langgraph | fsm-llm-crewai |
|---|---|---|
| FSM class / spec | src/fsm_llm/fsm_spec.py | src/fsm_llm/fsm_spec.py |
| Model router | src/fsm_llm/model_router.py (get_model_for_step) | src/fsm_llm/model_router.py (get_llm_for_step) |
| Prompts | src/fsm_llm/prompts/*.txt | src/fsm_llm/prompts/*.txt |
| Graph / pipeline | src/fsm_llm/graph.py (build_graph) | src/fsm_llm/crew.py (build_fsm_crew, run_pipeline) |
| CLI | src/fsm_llm/cli.py | src/fsm_llm/cli.py |
| State schema | src/fsm_llm/state.py (GraphState) | Implicit in task context chain |
Clone either repo and run it — the code matches the patterns above.