Structured Output Generation

The biggest reliability challenge in production LLM applications isn’t getting the model to understand — it’s getting it to respond in a format your downstream code can actually parse. A JSON field with an extra comma. A schema with different key names than expected. A response wrapped in markdown that breaks your parser.

Structured output generation is about solving this at the infrastructure level, not hoping the model behaves.

The Problem with Unstructured Outputs

LLMs are probabilistic. Every token is sampled from a probability distribution. Without constraints, the model might:

Expected: {"name": "John", "age": 30, "city": "NYC"}

Actual might be:
{"name": "John", "age": "30", "city": "NYC"}         ← age as string, not int
{"name": "John", "age": 30, city: "NYC"}             ← missing quotes on key
{"Name": "John", "Age": 30, "City": "NYC"}           ← wrong capitalization
Here's the extracted data: {"name": "John"...}       ← preamble text
{"name": "John", "age": 30, "city": "NYC", "extra": "field"}  ← hallucinated field

For one-off queries this might be fine. For a production pipeline processing 10,000 requests/day, you need consistency.

Approach 1: JSON Mode (API-Level)

The simplest solution for JSON output. Most major providers offer a dedicated mode:

# OpenAI JSON mode
response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": "Extract order details as JSON."},
        {"role": "user", "content": "Customer John Smith ordered 3 blue t-shirts size L, order total $89.97, shipping to Chicago."}
    ]
)
# Guaranteed to return valid JSON — no parsing errors

Limitation: JSON mode guarantees syntactic validity but not adherence to your specific schema. You still might get extra fields or wrong types.

Approach 2: Structured Outputs with Schema (Strongest Guarantee)

OpenAI’s Structured Outputs (2024) and similar features in Anthropic’s API let you provide a JSON Schema. The model’s output is constrained to match it exactly.

from openai import OpenAI
import json

client = OpenAI()

schema = {
    "type": "object",
    "properties": {
        "customer_name": {"type": "string"},
        "items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "product": {"type": "string"},
                    "quantity": {"type": "integer"},
                    "size": {"type": "string", "enum": ["XS", "S", "M", "L", "XL"]}
                },
                "required": ["product", "quantity"]
            }
        },
        "total": {"type": "number"},
        "shipping_city": {"type": "string"}
    },
    "required": ["customer_name", "items", "total"]
}

response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_schema", "json_schema": {"name": "order", "schema": schema, "strict": True}},
    messages=[{"role": "user", "content": "Customer John Smith ordered 3 blue t-shirts size L, total $89.97, shipping Chicago."}]
)

With strict: True, OpenAI uses constrained decoding — the model’s token probabilities are masked at generation time to only allow tokens that could form a valid completion of the schema. This is mathematically guaranteed, not just best-effort.

Approach 3: Function / Tool Calling

Originally designed for tool use (letting models call APIs), function calling also works excellently for structured extraction. The model is asked to “call a function” and must provide arguments matching the function signature.

tools = [{
    "type": "function",
    "function": {
        "name": "extract_order",
        "description": "Extract order details from customer message",
        "parameters": {
            "type": "object",
            "properties": {
                "customer_name": {"type": "string"},
                "total": {"type": "number"},
                "items": {
                    "type": "array",
                    "items": {"type": "string"}
                }
            },
            "required": ["customer_name", "total", "items"]
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-4o",
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "extract_order"}},
    messages=[{"role": "user", "content": "..."}]
)

# Extract the structured data
args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)

This approach works particularly well for Anthropic’s Claude, which has excellent tool calling support.

Approach 4: Constrained Decoding Libraries

For local models (LLaMA, Mistral, etc.), grammar-based constrained decoding lets you specify exact output grammars at the inference level.

# Using Outlines library
import outlines
from pydantic import BaseModel
from typing import List

class OrderItem(BaseModel):
    product: str
    quantity: int

class Order(BaseModel):
    customer_name: str
    items: List[OrderItem]
    total: float

model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
generator = outlines.generate.json(model, Order)

order = generator(
    "John Smith ordered 3 blue t-shirts and 1 black jacket. Total: $134.97"
)
# Returns a valid Order Pydantic object — guaranteed

Libraries in this space:

Outlines: Grammar-constrained decoding, Pydantic integration
Guidance: Microsoft’s template-based structured generation
LMQL: Query language for constrained LLM output
llama.cpp grammar: GBNF grammar files for C++ inference

Generating SQL Reliably

SQL generation is a common use case with specific challenges: table and column names must match your schema, JOINs must be valid, and syntax varies by database.

Best practice pattern:

System prompt:
You generate valid PostgreSQL queries for our database.
Schema:
  orders(id, customer_id, total, created_at, status)
  customers(id, name, email, city, created_at)
  order_items(id, order_id, product_name, quantity, price)

Rules:
- Use only columns that exist in the schema above
- Always include a LIMIT unless explicitly asked for all records
- Use parameterized style: $1, $2 for user-provided values
- Return ONLY the SQL query, nothing else

User: Show me all orders over $100 from customers in Chicago this month

For higher reliability, pair this with a query validator that checks the generated SQL against your actual schema before execution.

Output Parsing and Validation

Even with structured output APIs, defensive parsing is good practice:

from pydantic import BaseModel, ValidationError
import json

class ExtractedOrder(BaseModel):
    customer_name: str
    total: float
    items: list[str]

def parse_order_response(raw_json: str) -> ExtractedOrder | None:
    try:
        data = json.loads(raw_json)
        return ExtractedOrder(**data)
    except json.JSONDecodeError:
        # Log and retry with stricter prompt
        return None
    except ValidationError as e:
        # Log schema mismatch details for prompt debugging
        print(f"Schema mismatch: {e}")
        return None

For high-volume production systems, track parse failure rates by prompt version. A rising parse failure rate is a signal that the model’s output distribution has shifted (possible after provider model updates).

Choosing the Right Approach

Scenario	Best Approach
OpenAI / GPT-4 production API	Structured Outputs with JSON Schema
Anthropic Claude API	Tool calling with schema
Local open-source model	Outlines constrained decoding
Quick prototype	JSON mode + Pydantic validation
SQL generation	Schema-in-prompt + query validator
Complex multi-field extraction	Function calling with detailed schema

The general principle: move validation as close to generation as possible. Post-hoc parsing is fragile; constrained decoding at the token level is robust.