LLMOps
Moving an LLM application from prototype to production is harder than it looks. The model might work great in demos but drift in quality over time. Costs balloon unexpectedly. Latency spikes under load. Users are unhappy but you don’t know why.
LLMOps is the discipline of operating LLM-based systems reliably, efficiently, and at scale — borrowing from MLOps and DevOps but adapted for the unique challenges of generative AI.
How LLMOps Differs from Traditional MLOps
Traditional ML models:
- Deterministic outputs (given same input, same output)
- Performance degrades predictably with data drift
- Well-established metrics (accuracy, F1, AUC)
- Models are static artifacts between retraining
LLM applications:
- Non-deterministic outputs by design
- “Correctness” is subjective and context-dependent
- Third-party models change underneath you (OpenAI, Anthropic update models)
- Prompt changes are deployable changes — no retraining required
- Costs are variable based on token count, not just request count
The LLMOps Lifecycle
┌────────────────────────────────────────────────────────────────┐│ LLMOps Lifecycle │├──────────────┬──────────────┬──────────────┬──────────────────┤│ DEVELOP │ EVALUATE │ DEPLOY │ MONITOR ││ │ │ │ ││ Prompt dev │ Eval dataset │ Version mgmt │ Cost tracking ││ RAG tuning │ LLM judging │ Canary/A/B │ Latency tracking ││ Tool design │ Human review │ Rollback │ Quality metrics ││ Fine-tuning │ Regression │ Cache layer │ Error rates │└──────────────┴──────────────┴──────────────┴──────────────────┘Prompt Versioning and Management
Prompts are code. Treat them that way.
version: "2.3.1"model: "claude-3-5-sonnet-20241022"created: "2025-11-15"author: "team-ai"changelog: "Added explicit refund policy handling, improved escalation clarity"
system: | You are a customer support assistant for Acme Corp...
parameters: temperature: 0.3 max_tokens: 512
eval_metrics: last_run: "2025-11-14" faithfulness: 0.92 user_satisfaction: 4.3/5Store prompt versions in git. Roll back on quality regression. Tag versions that are deployed to production.
Observability: What to Instrument
Every LLM call should produce a trace. At minimum, capture:
# Structured LLM call loggingtrace = { "trace_id": str(uuid.uuid4()), "timestamp": datetime.utcnow().isoformat(), "model": "claude-3-5-sonnet-20241022", "prompt_version": "2.3.1", "environment": "production",
# Input "system_prompt_hash": hashlib.sha256(system_prompt.encode()).hexdigest()[:8], "input_tokens": response.usage.input_tokens, "user_message_length": len(user_message),
# Output "output_tokens": response.usage.output_tokens, "latency_ms": (end_time - start_time) * 1000, "finish_reason": response.stop_reason,
# Cost "cost_usd": calculate_cost(response.usage.input_tokens, response.usage.output_tokens, model="claude-3-5-sonnet"),
# Quality signals "user_feedback": None, # populated later from thumbs up/down "automated_eval_score": None # populated by async eval job}Cost Monitoring and Optimization
LLM costs can surprise you. A system that works fine at 100 users/day can break the budget at 10,000 users/day.
Track Costs Per Feature, Not Just Total
# Tag every LLM call with its featuretrace["feature"] = "document_summarization"# vs. "chat_response" vs. "code_review"
# Then aggregate:daily_cost_by_feature = db.query(""" SELECT feature, SUM(cost_usd) as total_cost, COUNT(*) as request_count, AVG(cost_usd) as avg_cost_per_request FROM llm_traces WHERE date = CURRENT_DATE GROUP BY feature ORDER BY total_cost DESC""")Optimization Techniques
Prompt compression: Tools like LLMLingua can compress prompts by 2–4× with minimal quality loss. A 2,000-token prompt becomes 600 tokens.
Caching:
import hashlibimport redis
cache = redis.Redis()
def cached_llm_call(prompt: str, ttl_seconds: int = 3600) -> str: cache_key = hashlib.sha256(prompt.encode()).hexdigest() cached = cache.get(cache_key) if cached: return cached.decode()
response = llm.generate(prompt) cache.setex(cache_key, ttl_seconds, response) return responseEffective for identical or near-identical prompts (FAQ answers, repeated system prompts with the same documents).
Model routing: Use a cheap model for simple tasks, expensive model for hard ones.
def route_to_model(query: str, complexity_threshold: float = 0.7) -> str: complexity = estimate_complexity(query) # your classifier
if complexity < complexity_threshold: return "claude-3-haiku-20240307" # $0.25/$1.25 per 1M tokens else: return "claude-3-5-sonnet-20241022" # $3/$15 per 1M tokensOutput length control: Explicitly limit response length. Every unnecessary output token costs money.
Deployment Strategies
Blue-Green Deployment
Run two versions simultaneously. Gradually shift traffic.
Week 1: 100% → v1.0 (old prompt)Week 2: 90% → v1.0, 10% → v1.1 (new prompt, monitor)Week 3: 50% → v1.0, 50% → v1.1 (if metrics good)Week 4: 0% → v1.0, 100% → v1.1 (full rollout)A/B Testing
Different user segments see different model versions. Measure downstream outcomes (conversion rate, session length, support escalation rate).
def get_model_version(user_id: str) -> str: # Deterministic assignment based on user ID if int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100 < 10: return "experiment" # 10% bucket return "control"Canary Releases
Route a small percentage of traffic to the new version. Watch for error rates, latency, and quality degradation before full rollout.
LLMOps Tools (2025–2026)
| Category | Tools |
|---|---|
| Observability | Langfuse, Helicone, Braintrust, Arize Phoenix |
| Prompt management | Langfuse, Promptflow, PromptLayer |
| Evaluation | Braintrust, DeepEval, RAGAS, TruLens |
| Model serving | vLLM, Ollama, AWS Bedrock, Vertex AI |
| Caching | Semantic cache (Zilliz), GPTCache, Redis |
| Cost tracking | Helicone, OpenMeter, custom via Langfuse |
| Gateways | LiteLLM (multi-provider proxy), Kong AI Gateway |
Production Readiness Checklist
Before going to production with an LLM application:
Observability
- Every LLM call is logged with trace ID, model, token counts, latency, cost
- Error rates dashboarded and alerted
- Cost per request tracked by feature
Quality
- Evaluation dataset exists (100+ examples minimum)
- Automated eval runs on every prompt change
- Quality metrics are baselined and regression alerting is set up
Safety
- Input validation / injection detection
- Output content safety checks
- Rate limiting per user
Reliability
- Retry logic with exponential backoff
- Timeout handling (LLM calls can hang)
- Fallback responses for model outages
- Circuit breaker for repeated failures
Operations
- Prompts are versioned and deployable independently
- Model version is pinned (don’t auto-upgrade)
- On-call runbook for LLM failures
The Ongoing Operations Reality
Unlike traditional software, LLM applications require ongoing attention:
- Model updates: Your provider updates their model (GPT-4 → GPT-4-turbo → GPT-4o). Quality may change. Test before relying on the new version.
- Prompt drift: A prompt that works well today may degrade as the underlying model is updated. Monitor quality metrics continuously.
- Cost inflation: As your product grows, LLM costs grow proportionally. Build cost optimization into your roadmap, not as a crisis response.
- Adversarial users: As your product scales, users will probe for safety gaps. Your guardrails and monitoring need to catch novel attacks, not just the ones you anticipated.
LLMOps isn’t a project with an end date — it’s an ongoing operational practice, like security or reliability engineering.