Step 1 — Foundation Models & Prompting

A quick word before we get into the material. There is no official “AWS Certified Generative AI Developer” exam as of this writing — you won’t find a code for it on AWS’s certification page, and nothing here should be mistaken for one. What you’re reading is a self-directed skills track, built around the direction AWS’s real credentials (AI Practitioner, ML Engineer Associate) and its Bedrock and AI-service documentation are already pointing: toward builders who can take a foundation model and turn it into a working, production-grade application. Treat this as a map of the territory, not a syllabus for a proctor. If and when AWS ships a real credential in this space, the skills below will still transfer directly.

With that out of the way — let’s talk about the actual craft of working with foundation models.

Picking a Model Is an Engineering Decision, Not a Preference

New Bedrock users tend to pick whichever model they’ve heard of and move on. That’s a mistake once you’re building something that has to run in production, get evaluated, and stay within a budget. Every model call you make is a tradeoff along four axes: quality, latency, cost per token, and context length. Bedrock gives you access to models from Anthropic’s Claude family, Amazon’s own Nova family, and a rotating set of third-party providers, and the right choice depends entirely on the job.

A classification task — routing a support ticket into one of eight categories — doesn’t need your most capable (and most expensive) model. A smaller, faster model handles it fine, and you’ll pay a fraction of the cost per call. A multi-step reasoning task — drafting a legal summary that has to track several interacting clauses — benefits from a larger, more capable model even though it costs more and responds slower. The failure mode I see most often in early-career builders is defaulting to the biggest model for everything, then being surprised when the AWS bill arrives.

TASK COMPLEXITY                MODEL TIER TO REACH FOR
──────────────────             ─────────────────────────────
Classification, extraction     Small/fast tier — cheap, low latency
Summarization, rewriting       Mid tier — balanced cost and quality
Multi-step reasoning, coding   Large/frontier tier — higher cost, higher ceiling
Multimodal (image + text)      Model family with vision support

A pattern worth adopting early: build with the cheap, fast model first. Get your prompt and your evaluation harness working end to end. Only reach for a bigger model once you’ve confirmed the smaller one genuinely can’t hit your quality bar. Flipping that order — starting big, then trying to downgrade later — almost never happens in practice, because nobody wants to re-test a system that’s already “working.”

Tokens, Context Windows, and Why They’re Not Just Trivia

Every model you call through Bedrock processes text as tokens, not characters or words. A token is roughly three-quarters of an English word on average, though it varies — common words are often a single token, rare words and most non-English text fragment into more. This matters practically because you’re billed per token, both on the input side (your prompt, any retrieved documents, conversation history) and the output side (what the model generates back).

The context window is the hard ceiling on how many tokens fit into a single call, input and output combined. By 2026, context windows on frontier-tier models are large enough to hold long documents or entire codebases, which really does change how you architect things — you’re no longer forced to shred every document into tiny fragments just to make it fit. But a bigger window isn’t a free lunch. Larger prompts cost more per call, take longer to process, and models still tend to pay less attention to information buried in the middle of a very long context than to what’s near the beginning or end. That’s often called positional bias, and it’s a real, measurable effect — not a theoretical concern.

A CALL'S TOKEN BUDGET
┌───────────────────────────────────────────────────────┐
│ System prompt │ Retrieved context │ History │ Response │
└───────────────────────────────────────────────────────┘
        billed as input tokens              billed as output

Practical implication: if your app is chatty (long conversation history) or RAG-heavy (large retrieved chunks), your input tokens will dominate your bill, not your output. Trim history aggressively, summarize old turns instead of replaying them verbatim, and only retrieve what you need.

Sampling Controls: Temperature and Top-p

When a model generates text, at each step it’s choosing the next token from a probability distribution over its vocabulary. Two knobs control how “adventurous” that choice is.

Temperature scales the randomness of that choice. At temperature 0, the model always picks the single most probable next token — deterministic, or close to it, and repeatable. As temperature rises, lower-probability tokens get a real chance of being picked, which produces more varied, sometimes more creative, sometimes less reliable output.

Top-p (nucleus sampling) works differently — instead of scaling all probabilities, it restricts the choice to the smallest set of tokens whose combined probability reaches a threshold p, then samples from just that set. A top-p of 0.9 means “only consider tokens that together make up 90% of the probability mass,” which cuts off the long tail of unlikely, potentially nonsensical tokens no matter how high the temperature is set.

Setting	Low value	High value	Use When
Temperature	Near-deterministic, repetitive	Creative, varied, riskier	Low for extraction/classification; higher for brainstorming/creative copy
Top-p	Narrow, “safe” vocabulary	Wide vocabulary, more surprise	Often left near 0.9–1.0, tuned alongside temperature rather than instead of it

A rule that’s served me well: for anything that gets parsed downstream — JSON output, a category label, a extracted field — push temperature toward zero. You want the same input to produce the same output every time you run your evaluation suite. For anything meant to feel human-authored — a marketing draft, a brainstorm — bump it up and accept some variance.

Prompt Engineering Techniques Worth Actually Practicing

Zero-shot prompting — just asking directly — is the starting point, and it’s often enough for simple, well-defined tasks. But three techniques go further and are worth deliberate practice, not just passive awareness.

Few-shot prompting shows the model two or three examples of the input-output pattern you want before asking it to do the real one. This is disproportionately effective for tasks with a specific output format — a particular JSON schema, a house style, a scoring rubric — because you’re demonstrating the pattern rather than describing it in prose, and models are very good at pattern-matching from examples.

Chain-of-thought prompting asks the model to work through intermediate reasoning steps before producing a final answer, rather than jumping straight to a conclusion. This measurably improves accuracy on multi-step problems — arithmetic, multi-clause logic, anything where skipping a step compounds into a wrong answer. The tradeoff is straightforward: more output tokens, more latency, more cost per call. Don’t reach for it on tasks that don’t actually need multi-step reasoning; you’ll just be paying for a longer answer to a simple question.

System prompts set persistent instructions that frame every turn of a conversation — the model’s role, its constraints, its tone, what it should refuse to do. This is where you put the things that shouldn’t change turn to turn: “You are a support assistant for Acme Corp. Only answer questions about Acme products. If asked about anything else, politely decline.” Keep system prompts stable and version them like code, because a small wording change here can shift behavior across your entire user base at once.

PROMPT ANATOMY FOR A PRODUCTION CALL
┌─────────────────────────────────────────┐
│ System prompt (role, rules, tone)        │  ← stable, versioned
├─────────────────────────────────────────┤
│ Few-shot examples (optional)             │  ← shows the pattern
├─────────────────────────────────────────┤
│ Retrieved context (optional, RAG)        │  ← grounds the answer
├─────────────────────────────────────────┤
│ User's actual question                   │  ← the variable part
└─────────────────────────────────────────┘

Order matters more than people expect. Instructions placed right before the user’s question tend to get followed more reliably than the same instructions buried at the very top of a long system prompt — another symptom of the positional bias mentioned earlier.

Evaluating Prompts Before You Ship Them

The single biggest mistake I see in early GenAI projects is shipping a prompt because it “looked good” on three manual test cases. That’s not evaluation, that’s vibes. A minimal but real evaluation setup needs three things: a held-out set of representative test inputs (ideally 20-plus, covering edge cases and adversarial phrasing, not just the happy path), a scoring method (exact match for structured output, a rubric scored by a separate judge model for open-ended text, or human review for anything high-stakes), and a way to re-run that scoring every time you touch the prompt.

Treat prompt changes the same way you’d treat a code change: run the eval set, compare the score to your previous baseline, and only ship if it’s neutral or better. Skipping this step is how teams end up in a loop of “fixing” one failure case while silently breaking three others they never re-tested.

Key Skills This Step Builds

Matching task complexity to the right model tier on Bedrock instead of defaulting to the largest available model
Reasoning about token costs and context window limits as real architectural constraints, not just billing footnotes
Choosing temperature and top-p deliberately based on whether output needs to be deterministic or varied
Writing few-shot examples that demonstrate format and style rather than describing them in prose
Applying chain-of-thought prompting selectively, only where multi-step reasoning actually improves accuracy
Structuring system prompts as versioned, stable instructions separate from per-request user input
Building a lightweight prompt evaluation harness with a fixed test set and repeatable scoring before shipping changes

Written by NPBlue Cloud Team — Cloud & Platform Engineers who runs production workloads on AWS daily and writes from real deployment experience, not the docs alone.

Reviewed for technical accuracy. Spot an error? Let us know.