Step 3 — Fine-Tuning & Customization
Here’s a question worth sitting with before you write any code: if prompting alone can’t get your application to behave the way you need, is the fix more instructions, more retrieved context, or an actually different model? Teams get this wrong constantly, usually by reaching for fine-tuning far too early, when the real problem was a prompt that never specified the output format clearly enough. This step is about making that decision correctly, and then executing on it if fine-tuning really is the answer.
Three Levers, One Decision Tree
You’ve got three ways to shape model behavior, and they’re not interchangeable substitutes — they solve different problems.
Prompting changes nothing about the model. You’re just steering it with better instructions, examples, or structure at request time. It’s the fastest to iterate on and the cheapest, and it should always be your first attempt, not your last resort.
RAG also changes nothing about the model — it changes what the model sees by retrieving relevant information at request time. It’s the right tool when the problem is a knowledge gap: the model doesn’t know your product catalog, your internal policies, or anything that happened after its training cutoff.
Fine-tuning actually adjusts the model’s weights through additional training on your own examples. It’s the right tool when the problem isn’t missing knowledge but a mismatch in behavior — the model knows the facts but won’t format answers the way your application needs, won’t adopt the tone your brand requires, or struggles with a narrow specialized task (classifying legal clauses into your firm’s exact taxonomy, say) that plain prompting can’t reliably pin down no matter how carefully you word it.
DECISION FLOW──────────────────────────────────────────────────Is the model missing facts it needs? YES → RAG (retrieve the missing information) NO ↓Can better instructions/examples fix the behavior? YES → Prompt engineering (cheapest, fastest) NO ↓Is the issue tone, format, or a narrow specializedskill that prompting can't reliably enforce? YES → Fine-tuningA mistake worth naming directly: fine-tuning does not fix a knowledge gap well. If you fine-tune a model on a snapshot of your product catalog, that catalog is frozen into the weights at training time — next month’s price change or new SKU won’t show up until you retrain. RAG stays current because it retrieves live data at request time. People occasionally try to solve a freshness problem with fine-tuning and end up rebuilding a stale, expensive version of what RAG already does better.
Continued Pre-Training vs Instruction Fine-Tuning
Once you’ve decided fine-tuning is warranted, there are two conceptually different approaches, and mixing them up leads to wasted training runs.
Continued pre-training keeps the model’s original training objective (predicting the next token) but continues it on a large volume of your own unlabeled domain text — internal documentation, industry-specific writing, historical records. This teaches the model your domain’s vocabulary, style, and factual texture at a broad level, without teaching it a specific input-output task. Think of it as extending the model’s general education in your specific subject matter.
Instruction fine-tuning trains the model on paired examples of instruction and desired response — “given this input, produce this exact output.” This is what you use to teach a specific behavior: always respond in a particular JSON schema, always adopt a specific tone, always follow a specific classification taxonomy. It requires far less data than continued pre-training but is much more task-specific in what it teaches.
| Approach | Data Needed | Teaches | Typical Data Volume |
|---|---|---|---|
| Continued pre-training | Large unlabeled domain corpus | Vocabulary, style, domain fluency | Very large — comparable to a real training corpus |
| Instruction fine-tuning | Labeled input/output pairs | A specific task or behavior | Hundreds to low thousands of well-curated examples |
Most builder-level fine-tuning projects on Bedrock are instruction fine-tuning, not continued pre-training — the latter is a heavier, more specialized undertaking that’s closer to what a model provider does than what an application team typically needs.
Preparing Data for Fine-Tuning
Fine-tuning quality is almost entirely a data quality problem, not a technique problem. A few hundred carefully curated, correct, diverse examples reliably outperforms several thousand scraped, inconsistent ones. Before you start a training job, your dataset needs:
- Consistency — every example should follow the same format and the same implicit rules you want the model to learn. If half your examples format dates one way and half another, the model learns the inconsistency, not a clean rule.
- Coverage of edge cases — don’t just include the easy, obvious examples. Include the ambiguous, borderline cases that actually trip up prompting alone, since those are the cases fine-tuning is meant to fix.
- Held-out validation data — a separate slice of examples the model never trains on, used purely to check whether it’s actually learning to generalize or just memorizing the training set.
- Balanced representation — if you’re teaching a classifier-like behavior, make sure minority categories aren’t drowned out by a handful of dominant ones, or the fine-tuned model will quietly default toward the majority class.
FINE-TUNING DATA PIPELINERaw examples ──► De-duplicate ──► Format-check ──► Human review │ ┌────────────────────────────┘ ▼ Split: Train set │ Validation set (held out) │ │ ▼ ▼ Training job Evaluate after trainingWhen Bedrock Customization Isn’t Enough: SageMaker
Bedrock’s customization features cover a large share of practical fine-tuning needs — instruction fine-tuning and, for some model families, continued pre-training — without you managing any training infrastructure. But there are cases where you need more control than a managed customization API gives you: a fully custom architecture, training techniques Bedrock doesn’t expose, tighter control over hyperparameters and training infrastructure, or training a model from open weights with a highly specialized pipeline. That’s where SageMaker comes in — it gives you the underlying compute, training job orchestration, and experiment tracking to run custom training loops directly, at the cost of having to manage far more of the process yourself.
The practical rule: reach for Bedrock customization first, because it removes an enormous amount of infrastructure work. Drop down to SageMaker only when you’ve confirmed Bedrock’s managed customization genuinely can’t express what you need — not because SageMaker feels more “serious” or hands-on.
Evaluating a Fine-Tuned Model Honestly
A fine-tuned model needs to be evaluated against two separate questions, and conflating them is a common mistake. First: did it actually improve on the target task, measured against your held-out validation set and ideally a separate test set it’s never seen? Second: did it get worse at anything it used to do fine — a phenomenon usually called catastrophic forgetting, where narrow fine-tuning on a specific task degrades general capability the base model had. Testing only the narrow task and declaring victory is how teams ship a fine-tuned model that aced its target benchmark but became noticeably worse at everyday requests outside that scope.
Compare your fine-tuned model’s outputs against both the base model’s outputs and, where feasible, against a well-crafted prompt-only baseline on the same task. If a carefully engineered prompt gets you 90% of the way there, the incremental gain from fine-tuning may not justify the added cost, latency of retraining cycles, and operational complexity of managing a custom model artifact.
Cost and Latency Tradeoffs, Honestly Stated
Fine-tuning isn’t free, and it isn’t just the training job cost. A custom fine-tuned model typically means dedicated hosting rather than the pay-per-token, shared-capacity pricing of an off-the-shelf foundation model call — which changes your cost structure from “pay per request” to something closer to “pay for provisioned capacity,” even during low-traffic periods. That can be cheaper at high, steady volume and considerably more expensive at low or spiky volume.
Latency-wise, a well-tuned smaller fine-tuned model can actually respond faster than a larger general-purpose model, because it doesn’t need a long few-shot prompt or large retrieved context to get the behavior right — the behavior is baked into the weights. That’s a genuine advantage in latency-sensitive applications, but it only pays off if the fine-tuning was worth doing in the first place.
Key Skills This Step Builds
- Applying the prompting-versus-RAG-versus-fine-tuning decision tree instead of defaulting to fine-tuning as a first move
- Distinguishing continued pre-training from instruction fine-tuning and matching each to the right kind of data
- Building a fine-tuning dataset with consistency, edge-case coverage, and a genuinely held-out validation split
- Recognizing when Bedrock’s managed customization is sufficient versus when SageMaker’s lower-level control is actually needed
- Evaluating a fine-tuned model for both task improvement and regression (catastrophic forgetting) rather than one-sided success metrics
- Reasoning about the cost and latency tradeoffs of dedicated fine-tuned model hosting versus shared, pay-per-token inference