← Back to Payloads
Open Source2026-06-03· 4 min read

Outlines: Stop Parsing LLM Output. Force the Model to Speak Your Schema at the Token Level.

Instructor and PydanticAI fix structured outputs by re-parsing whatever the model said and hoping for the best. Outlines takes a different bet: it constrains the token sampler itself, so the model physically cannot emit a byte that violates your JSON schema. That architectural difference is the most under-discussed idea in open-source LLM tooling right now.
Quick Access
Install command
$ mrt install outlines
Browse related skills
Outlines: Stop Parsing LLM Output. Force the Model to Speak Your Schema at the Token Level.

Most "structured output" libraries for LLMs do the same thing: let the model generate whatever it wants, parse the result, validate it, retry on failure. Instructor wraps Pydantic around an OpenAI call. PydanticAI does the same with a different API. Both patch over a problem they didn't solve. The model is still free to hallucinate, truncate, or invent a field.

Outlines, the open-source library from the .txt team, takes a different bet. It compiles your JSON schema, regex, or Pydantic model into a finite state machine, hands that FSM to the token sampler, and forces every generated token to be a legal transition. The model cannot emit a closing brace before its corresponding field, cannot output malformed JSON, cannot invent a value outside your schema. This is not a wrapper. It's a constraint on generation itself.

Why Parsing Is The Wrong Layer

The standard pattern: call the LLM, get back a string that is allegedly JSON, run it through `json.loads`, catch the exception, append "respond with valid JSON," retry, hope. This works 90-95% of the time. The remaining 5-10% is a tax you pay forever in latency, cost, and 3am bugs. Instructor and PydanticAI push that to 1-2% with structured retries — better, but every retry is another full LLM call.

The deeper problem: prompt-level instructions are an honor system. The model reads them, does its best, and gets it wrong in subtle ways — trailing commas, unescaped quotes, wrong types for nested optionals. The reliability ceiling never reaches 100%. The constraint belongs on the sampler, not on the prompt.

How It Works

Outlines builds an index over your tokenizer's vocabulary — every legal token mapped to the FSM states it would advance. When the model is about to sample the next token, Outlines masks the logits so any token not aligned with a valid transition is set to negative infinity. The sampler has no choice but to pick a legal token. The model cannot hallucinate a byte it doesn't have.

import outlines

from pydantic import BaseModel

from typing import Literal

model = outlines.from_transformers(

"microsoft/Phi-3-mini-4k-instruct", device="cuda")

class Invoice(BaseModel):

vendor: str; total: float; currency: Literal["USD", "EUR", "GBP"]

invoice = model("Extract: ACME Corp invoice #4421 for $1,240 USD.", Invoice)

invoice.vendor == "ACME Corp", invoice.total == 1240.0

No retry loop. No `try/except` around `json.loads`. The output is the schema. If the model is uncertain about a field, it can't bluff — it has to commit to a legal value or emit a stop token.

The vLLM Integration Compounds It

vLLM is the production inference server for local models, and Outlines is one of its guided-decoding backends. Pre-compile the FSM once at startup, reuse it across every request. The marginal cost in steady state is essentially zero — a fast logit mask. Thousands of constrained requests per second per GPU, no parsing-and-retry tax. This is the pattern that lets you treat structured generation as the default rather than a special case.

What It's For

The sweet spot is high-volume extraction pipelines with stable schemas — invoice parsing, contract clause extraction, log normalization, taxonomy tagging. Anywhere you would have written "ask the LLM, parse the JSON, retry on failure," write "ask the LLM with a Pydantic model and trust the output."

The second sweet spot is agent tool calling. The OpenAI function-calling format is a JSON schema, and Outlines guarantees the model emits exactly that schema.

The third is local inference. Open-source models drift in structured output, especially at smaller sizes. A 7B model with Outlines beats a 70B model on prompt instructions alone for structured output.

The Honest Limitations

The first call compiles the grammar, which is slow for complex schemas — hundreds of fields can take seconds. Cache the grammar at startup. Thousands of dynamic schemas is a real cost.

The constraint applies to a single sampling step. It cannot make the model know the right answer. If the model is confidently wrong about a fact — hallucinating a name, fabricating a citation — Outlines won't save you. It guarantees format, not truth.

The OpenAI path is weaker than the local path because OpenAI doesn't expose raw logits. Outlines uses API-level mechanisms with FSM-based post-filtering as fallback — good, but not the vLLM guarantee. Documentation is also uneven. You'll read the source.

The Take

Outlines is the most important open-source LLM tooling release of the last two years that you've probably under-weighted. The other libraries are parsing; Outlines is constraining. The difference is a failure rate that is, in practice, indistinguishable from zero on the local-inference path.

The architectural bet is right. The token sampler is the right layer. The throughput story is right. If you are building extraction pipelines, agent tool calling, or any system where JSON-schema reliability is a production requirement, use Outlines. If you are using Instructor or PydanticAI to glue retries around a model that is still free to break your schema, you are solving yesterday's problem. Install it.

*Outlines is open source at [github.com/dottxt-ai/outlines](https://github.com/dottxt-ai/outlines) — formerly outlines-dev. Apache 2.0. Built by the team at [.txt](https://dottxt.co). Integrations: Hugging Face Transformers, vLLM, llama.cpp, OpenAI API. Pydantic-native. Used in production by NVIDIA, Cohere, Hugging Face, and vLLM. 8K+ stars, active 2026 development.*