LLMOps

Moving an LLM application from prototype to production is harder than it looks. The model might work great in demos but drift in quality over time. Costs balloon unexpectedly. Latency spikes under load. Users are unhappy but you don’t know why.

LLMOps is the discipline of operating LLM-based systems reliably, efficiently, and at scale — borrowing from MLOps and DevOps but adapted for the unique challenges of generative AI.

How LLMOps Differs from Traditional MLOps

Traditional ML models:

Deterministic outputs (given same input, same output)
Performance degrades predictably with data drift
Well-established metrics (accuracy, F1, AUC)
Models are static artifacts between retraining

LLM applications:

Non-deterministic outputs by design
“Correctness” is subjective and context-dependent
Third-party models change underneath you (OpenAI, Anthropic update models)
Prompt changes are deployable changes — no retraining required
Costs are variable based on token count, not just request count

The LLMOps Lifecycle

┌────────────────────────────────────────────────────────────────┐
│                     LLMOps Lifecycle                           │
├──────────────┬──────────────┬──────────────┬──────────────────┤
│  DEVELOP     │  EVALUATE    │  DEPLOY      │  MONITOR         │
│              │              │              │                  │
│ Prompt dev   │ Eval dataset │ Version mgmt │ Cost tracking    │
│ RAG tuning   │ LLM judging  │ Canary/A/B   │ Latency tracking │
│ Tool design  │ Human review │ Rollback     │ Quality metrics  │
│ Fine-tuning  │ Regression   │ Cache layer  │ Error rates      │
└──────────────┴──────────────┴──────────────┴──────────────────┘

Prompt Versioning and Management

Prompts are code. Treat them that way.

version: "2.3.1"
model: "claude-3-5-sonnet-20241022"
created: "2025-11-15"
author: "team-ai"
changelog: "Added explicit refund policy handling, improved escalation clarity"

system: |
  You are a customer support assistant for Acme Corp...

parameters:
  temperature: 0.3
  max_tokens: 512

eval_metrics:
  last_run: "2025-11-14"
  faithfulness: 0.92
  user_satisfaction: 4.3/5

Store prompt versions in git. Roll back on quality regression. Tag versions that are deployed to production.

Observability: What to Instrument

Every LLM call should produce a trace. At minimum, capture:

# Structured LLM call logging
trace = {
    "trace_id": str(uuid.uuid4()),
    "timestamp": datetime.utcnow().isoformat(),
    "model": "claude-3-5-sonnet-20241022",
    "prompt_version": "2.3.1",
    "environment": "production",

    # Input
    "system_prompt_hash": hashlib.sha256(system_prompt.encode()).hexdigest()[:8],
    "input_tokens": response.usage.input_tokens,
    "user_message_length": len(user_message),

    # Output
    "output_tokens": response.usage.output_tokens,
    "latency_ms": (end_time - start_time) * 1000,
    "finish_reason": response.stop_reason,

    # Cost
    "cost_usd": calculate_cost(response.usage.input_tokens,
                               response.usage.output_tokens,
                               model="claude-3-5-sonnet"),

    # Quality signals
    "user_feedback": None,  # populated later from thumbs up/down
    "automated_eval_score": None  # populated by async eval job
}

Cost Monitoring and Optimization

LLM costs can surprise you. A system that works fine at 100 users/day can break the budget at 10,000 users/day.

Track Costs Per Feature, Not Just Total

# Tag every LLM call with its feature
trace["feature"] = "document_summarization"
# vs. "chat_response" vs. "code_review"

# Then aggregate:
daily_cost_by_feature = db.query("""
    SELECT feature,
           SUM(cost_usd) as total_cost,
           COUNT(*) as request_count,
           AVG(cost_usd) as avg_cost_per_request
    FROM llm_traces
    WHERE date = CURRENT_DATE
    GROUP BY feature
    ORDER BY total_cost DESC
""")

Optimization Techniques

Prompt compression: Tools like LLMLingua can compress prompts by 2–4× with minimal quality loss. A 2,000-token prompt becomes 600 tokens.

Caching:

import hashlib
import redis

cache = redis.Redis()

def cached_llm_call(prompt: str, ttl_seconds: int = 3600) -> str:
    cache_key = hashlib.sha256(prompt.encode()).hexdigest()
    cached = cache.get(cache_key)
    if cached:
        return cached.decode()

    response = llm.generate(prompt)
    cache.setex(cache_key, ttl_seconds, response)
    return response

Effective for identical or near-identical prompts (FAQ answers, repeated system prompts with the same documents).

Model routing: Use a cheap model for simple tasks, expensive model for hard ones.

def route_to_model(query: str, complexity_threshold: float = 0.7) -> str:
    complexity = estimate_complexity(query)  # your classifier

    if complexity < complexity_threshold:
        return "claude-3-haiku-20240307"    # $0.25/$1.25 per 1M tokens
    else:
        return "claude-3-5-sonnet-20241022"  # $3/$15 per 1M tokens

Output length control: Explicitly limit response length. Every unnecessary output token costs money.

Deployment Strategies

Blue-Green Deployment

Run two versions simultaneously. Gradually shift traffic.

Week 1: 100% → v1.0 (old prompt)
Week 2:  90% → v1.0, 10% → v1.1 (new prompt, monitor)
Week 3:  50% → v1.0, 50% → v1.1 (if metrics good)
Week 4:   0% → v1.0, 100% → v1.1 (full rollout)

A/B Testing

Different user segments see different model versions. Measure downstream outcomes (conversion rate, session length, support escalation rate).

def get_model_version(user_id: str) -> str:
    # Deterministic assignment based on user ID
    if int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100 < 10:
        return "experiment"  # 10% bucket
    return "control"

Canary Releases

Route a small percentage of traffic to the new version. Watch for error rates, latency, and quality degradation before full rollout.

LLMOps Tools (2025–2026)

Category	Tools
Observability	Langfuse, Helicone, Braintrust, Arize Phoenix
Prompt management	Langfuse, Promptflow, PromptLayer
Evaluation	Braintrust, DeepEval, RAGAS, TruLens
Model serving	vLLM, Ollama, AWS Bedrock, Vertex AI
Caching	Semantic cache (Zilliz), GPTCache, Redis
Cost tracking	Helicone, OpenMeter, custom via Langfuse
Gateways	LiteLLM (multi-provider proxy), Kong AI Gateway

Production Readiness Checklist

Before going to production with an LLM application:

Observability

Every LLM call is logged with trace ID, model, token counts, latency, cost
Error rates dashboarded and alerted
Cost per request tracked by feature

Quality

Evaluation dataset exists (100+ examples minimum)
Automated eval runs on every prompt change
Quality metrics are baselined and regression alerting is set up

Safety

Input validation / injection detection
Output content safety checks
Rate limiting per user

Reliability

Retry logic with exponential backoff
Timeout handling (LLM calls can hang)
Fallback responses for model outages
Circuit breaker for repeated failures

Operations

Prompts are versioned and deployable independently
Model version is pinned (don’t auto-upgrade)
On-call runbook for LLM failures

The Ongoing Operations Reality

Unlike traditional software, LLM applications require ongoing attention:

Model updates: Your provider updates their model (GPT-4 → GPT-4-turbo → GPT-4o). Quality may change. Test before relying on the new version.
Prompt drift: A prompt that works well today may degrade as the underlying model is updated. Monitor quality metrics continuously.
Cost inflation: As your product grows, LLM costs grow proportionally. Build cost optimization into your roadmap, not as a crisis response.
Adversarial users: As your product scales, users will probe for safety gaps. Your guardrails and monitoring need to catch novel attacks, not just the ones you anticipated.

LLMOps isn’t a project with an end date — it’s an ongoing operational practice, like security or reliability engineering.