Step 2 — Generative AI & Foundation Models
Ask ten people what a “foundation model” actually is and you’ll get ten vague answers involving the word “big.” The exam wants precision, not vibes. This step pulls apart the machinery — what these models are built from, how you steer them, and where they quietly fail you if you’re not careful.
What Makes a Model a “Foundation Model”
A foundation model is a large model trained on a broad, diverse swath of data — text, code, images, or a mix — such that it develops general-purpose capabilities rather than being built for one narrow task. The defining trait isn’t size alone; it’s that one trained model can be adapted to many downstream jobs: summarizing a document today, drafting an email tomorrow, classifying support tickets the day after, all without retraining from scratch.
Compare that to the older, narrow-ML approach:
TRADITIONAL ML FOUNDATION MODEL APPROACH──────────────────────── ────────────────────────────Train model A → spam filter Train ONE large model onTrain model B → sentiment score massive general dataTrain model C → topic classifier │Train model D → translation ┌────────┴────────┐ ▼ ▼ ▼Each model: separate Prompt Fine-tune RAG-augmenttraining run, separate for spam for legal for companydataset, narrow skill detection docs Q&A knowledge baseThat single foundation model can be steered toward wildly different jobs depending on how you adapt it — which brings us to the three adaptation strategies the exam cares about.
Pretraining, Fine-Tuning, and Prompting — Three Different Levers
Pretraining happens once, by the model provider, on enormous datasets, at enormous cost. This is where the model learns grammar, facts, reasoning patterns, and coding syntax. As someone building on AWS, you almost never do this yourself — you consume an already-pretrained model through Bedrock or SageMaker JumpStart.
Fine-tuning takes a pretrained model and continues training it on a smaller, task-specific or domain-specific dataset, adjusting the model’s weights so it performs better on that narrower job — say, a customer service tone specific to your company, or terminology specific to your industry. It costs more than prompting and takes more time, but it can bake behavior in more durably.
Prompting is the cheapest and fastest lever: you don’t change the model’s weights at all, you just craft the input text carefully to get the output you want. Techniques like zero-shot (just ask), few-shot (show examples in the prompt), and chain-of-thought (ask the model to reason step by step) all fall under this umbrella.
| Approach | Changes Model Weights? | Cost | Speed to Implement | Best For |
|---|---|---|---|---|
| Pretraining | Yes — from scratch | Very high | Months | Building a brand-new foundation model (rare for most orgs) |
| Fine-tuning | Yes — incremental | Moderate to high | Days to weeks | Domain-specific tone, terminology, or task specialization |
| Prompt engineering | No | Low | Minutes | Quick iteration, general-purpose tasks |
| RAG (retrieval-augmented generation) | No | Low to moderate | Hours to days | Grounding answers in current, private, or factual data |
Notice RAG sits in that table too — it’s not fine-tuning, and the exam loves to test that distinction. RAG doesn’t change the model at all; it changes what information the model sees at the moment of answering, by retrieving relevant documents and stuffing them into the prompt.
Tokens and Context Windows, Without the Math
A model doesn’t read text the way you do. It breaks input into tokens — chunks that might be a whole word, part of a word, or punctuation. As a rough rule of thumb, a token is a bit less than one English word on average. “Understanding” might split into two tokens; “cat” is probably one.
The context window is the maximum number of tokens a model can consider at once — both what you feed in (the prompt, any retrieved documents, conversation history) and what it generates back. Run out of room, and older content gets truncated or dropped.
CONTEXT WINDOW (fixed capacity, measured in tokens)┌────────────────────────────────────────────────────────┐│ System │ Retrieved │ Conversation │ Model ││ Prompt │ Documents │ History │ Output ││ (rules) │ (RAG chunks) │ (prior turns) │ (reply) │└────────────────────────────────────────────────────────┘ ▲ ▲ everything above must fit inside one window ──────┘By 2026, context windows on frontier models have grown large enough to hold entire books or codebases in a single pass, which changes the design conversation: instead of always chunking documents into tiny fragments, teams can sometimes feed much larger source material directly. But bigger context windows cost more per request and can still suffer from the model paying less attention to content buried in the middle of a very long prompt — so retrieval and summarization remain relevant skills, not obsolete ones.
Embeddings and Vector Search, Conceptually
An embedding is a numeric representation of text (or an image, or audio) — a list of numbers, a vector, positioned in a high-dimensional space such that semantically similar items end up near each other. “Puppy” and “dog” land close together; “puppy” and “spreadsheet” land far apart.
This is the trick behind semantic search: instead of matching exact keywords, you convert a search query into an embedding and find the stored documents whose embeddings are nearest to it — even if they don’t share a single word in common.
"How do I reset my password?" │ [ Embed the query ] │ ▼ Vector: [0.12, -0.87, 0.44, ...] │ ┌──────────────┼──────────────┐ ▼ ▼ ▼ Doc: "Account Doc: "Login Doc: "Shipping recovery steps" troubleshooting" policy FAQ" distance: 0.04 distance: 0.09 distance: 0.91 │ │ ┌───┴──────────────┘ ▼ Closest matches returned to the model as contextA vector database (or vector index) stores millions of these embeddings and can retrieve the nearest neighbors fast. On AWS, this capability shows up inside Bedrock Knowledge Bases, in OpenSearch Service’s vector engine, and in vector capabilities added to several managed database services — the exam wants you to recognize the pattern conceptually more than memorize every product name that supports it.
What Generative AI Is Actually Good At
- Summarization — condensing long documents, meeting transcripts, or support threads into digestible summaries
- Conversational assistants — chatbots and virtual agents that hold context across a conversation
- Code generation — drafting functions, explaining unfamiliar code, suggesting fixes
- Content creation — marketing copy, product descriptions, drafts of reports
- Data extraction and transformation — pulling structured fields out of unstructured text, rewriting content into a different format or tone
Where It Falls Down — Know These Cold
The exam will absolutely test your understanding of generative AI’s limitations, because deploying it responsibly means knowing where it can quietly mislead you.
Hallucination — The model generates plausible-sounding but factually incorrect content, stated with the same confidence as correct content. It isn’t “lying” — it’s predicting likely-sounding text, and likely-sounding isn’t the same as true. This is precisely why RAG and grounding in verified data matter so much for factual use cases.
Bias — Models learn from training data, and training data reflects the biases present in the world and in whoever curated it. A model can reproduce and even amplify stereotypes present in its training corpus unless deliberately mitigated.
Non-determinism — Ask the same model the same question twice and you may get two differently worded (sometimes substantively different) answers, especially with non-zero “temperature” settings that inject randomness into generation. This matters for testing, auditing, and any workflow that assumes reproducibility.
Lack of true understanding — These models predict statistically likely next tokens; they don’t “reason” the way a human does, even when their output reads like careful reasoning. That’s worth sitting with — output fluency is not proof of correctness.
Exam Focus: What Questions Test From This Step
- Defining a foundation model correctly: broad training, general-purpose adaptability, not “just a big model”
- Distinguishing pretraining vs. fine-tuning vs. prompting vs. RAG — especially that RAG does not alter model weights
- Understanding tokens as the unit models process, and context window as the hard capacity limit on a request
- Recognizing embeddings as numeric vectors capturing semantic meaning, and vector search as nearest-neighbor retrieval
- Identifying hallucination, bias, and non-determinism as inherent risks, not edge-case bugs
- Matching a described business need (chatbot, summarizer, code assistant) to the generative AI use case category it represents