Step 2 — Generative AI & Foundation Models

Ask ten people what a “foundation model” actually is and you’ll get ten vague answers involving the word “big.” The exam wants precision, not vibes. This step pulls apart the machinery — what these models are built from, how you steer them, and where they quietly fail you if you’re not careful.

What Makes a Model a “Foundation Model”

A foundation model is a large model trained on a broad, diverse swath of data — text, code, images, or a mix — such that it develops general-purpose capabilities rather than being built for one narrow task. The defining trait isn’t size alone; it’s that one trained model can be adapted to many downstream jobs: summarizing a document today, drafting an email tomorrow, classifying support tickets the day after, all without retraining from scratch.

Compare that to the older, narrow-ML approach:

TRADITIONAL ML                         FOUNDATION MODEL APPROACH
────────────────────────               ────────────────────────────
Train model A → spam filter            Train ONE large model on
Train model B → sentiment score        massive general data
Train model C → topic classifier                │
Train model D → translation             ┌────────┴────────┐
                                         ▼         ▼         ▼
Each model: separate                  Prompt    Fine-tune   RAG-augment
training run, separate                for spam  for legal   for company
dataset, narrow skill                 detection  docs Q&A   knowledge base

That single foundation model can be steered toward wildly different jobs depending on how you adapt it — which brings us to the three adaptation strategies the exam cares about.

Pretraining, Fine-Tuning, and Prompting — Three Different Levers

Pretraining happens once, by the model provider, on enormous datasets, at enormous cost. This is where the model learns grammar, facts, reasoning patterns, and coding syntax. As someone building on AWS, you almost never do this yourself — you consume an already-pretrained model through Bedrock or SageMaker JumpStart.

Fine-tuning takes a pretrained model and continues training it on a smaller, task-specific or domain-specific dataset, adjusting the model’s weights so it performs better on that narrower job — say, a customer service tone specific to your company, or terminology specific to your industry. It costs more than prompting and takes more time, but it can bake behavior in more durably.

Prompting is the cheapest and fastest lever: you don’t change the model’s weights at all, you just craft the input text carefully to get the output you want. Techniques like zero-shot (just ask), few-shot (show examples in the prompt), and chain-of-thought (ask the model to reason step by step) all fall under this umbrella.

Approach	Changes Model Weights?	Cost	Speed to Implement	Best For
Pretraining	Yes — from scratch	Very high	Months	Building a brand-new foundation model (rare for most orgs)
Fine-tuning	Yes — incremental	Moderate to high	Days to weeks	Domain-specific tone, terminology, or task specialization
Prompt engineering	No	Low	Minutes	Quick iteration, general-purpose tasks
RAG (retrieval-augmented generation)	No	Low to moderate	Hours to days	Grounding answers in current, private, or factual data

Notice RAG sits in that table too — it’s not fine-tuning, and the exam loves to test that distinction. RAG doesn’t change the model at all; it changes what information the model sees at the moment of answering, by retrieving relevant documents and stuffing them into the prompt.

Tokens and Context Windows, Without the Math

A model doesn’t read text the way you do. It breaks input into tokens — chunks that might be a whole word, part of a word, or punctuation. As a rough rule of thumb, a token is a bit less than one English word on average. “Understanding” might split into two tokens; “cat” is probably one.

The context window is the maximum number of tokens a model can consider at once — both what you feed in (the prompt, any retrieved documents, conversation history) and what it generates back. Run out of room, and older content gets truncated or dropped.

CONTEXT WINDOW (fixed capacity, measured in tokens)
┌────────────────────────────────────────────────────────┐
│  System    │   Retrieved    │   Conversation  │  Model  │
│  Prompt    │   Documents    │   History        │ Output  │
│  (rules)   │   (RAG chunks) │   (prior turns)  │ (reply) │
└────────────────────────────────────────────────────────┘
     ▲                                                 ▲
     everything above must fit inside one window ──────┘

By 2026, context windows on frontier models have grown large enough to hold entire books or codebases in a single pass, which changes the design conversation: instead of always chunking documents into tiny fragments, teams can sometimes feed much larger source material directly. But bigger context windows cost more per request and can still suffer from the model paying less attention to content buried in the middle of a very long prompt — so retrieval and summarization remain relevant skills, not obsolete ones.

Embeddings and Vector Search, Conceptually

An embedding is a numeric representation of text (or an image, or audio) — a list of numbers, a vector, positioned in a high-dimensional space such that semantically similar items end up near each other. “Puppy” and “dog” land close together; “puppy” and “spreadsheet” land far apart.

This is the trick behind semantic search: instead of matching exact keywords, you convert a search query into an embedding and find the stored documents whose embeddings are nearest to it — even if they don’t share a single word in common.

                    "How do I reset my password?"
                                │
                        [ Embed the query ]
                                │
                                ▼
                     Vector: [0.12, -0.87, 0.44, ...]
                                │
                 ┌──────────────┼──────────────┐
                 ▼              ▼              ▼
          Doc: "Account     Doc: "Login      Doc: "Shipping
          recovery steps"   troubleshooting"  policy FAQ"
          distance: 0.04     distance: 0.09    distance: 0.91
                 │              │
             ┌───┴──────────────┘
             ▼
     Closest matches returned to the model as context

A vector database (or vector index) stores millions of these embeddings and can retrieve the nearest neighbors fast. On AWS, this capability shows up inside Bedrock Knowledge Bases, in OpenSearch Service’s vector engine, and in vector capabilities added to several managed database services — the exam wants you to recognize the pattern conceptually more than memorize every product name that supports it.

What Generative AI Is Actually Good At

Summarization — condensing long documents, meeting transcripts, or support threads into digestible summaries
Conversational assistants — chatbots and virtual agents that hold context across a conversation
Code generation — drafting functions, explaining unfamiliar code, suggesting fixes
Content creation — marketing copy, product descriptions, drafts of reports
Data extraction and transformation — pulling structured fields out of unstructured text, rewriting content into a different format or tone

Where It Falls Down — Know These Cold

The exam will absolutely test your understanding of generative AI’s limitations, because deploying it responsibly means knowing where it can quietly mislead you.

Hallucination — The model generates plausible-sounding but factually incorrect content, stated with the same confidence as correct content. It isn’t “lying” — it’s predicting likely-sounding text, and likely-sounding isn’t the same as true. This is precisely why RAG and grounding in verified data matter so much for factual use cases.

Bias — Models learn from training data, and training data reflects the biases present in the world and in whoever curated it. A model can reproduce and even amplify stereotypes present in its training corpus unless deliberately mitigated.

Non-determinism — Ask the same model the same question twice and you may get two differently worded (sometimes substantively different) answers, especially with non-zero “temperature” settings that inject randomness into generation. This matters for testing, auditing, and any workflow that assumes reproducibility.

Lack of true understanding — These models predict statistically likely next tokens; they don’t “reason” the way a human does, even when their output reads like careful reasoning. That’s worth sitting with — output fluency is not proof of correctness.

Exam Focus: What Questions Test From This Step

Defining a foundation model correctly: broad training, general-purpose adaptability, not “just a big model”
Distinguishing pretraining vs. fine-tuning vs. prompting vs. RAG — especially that RAG does not alter model weights
Understanding tokens as the unit models process, and context window as the hard capacity limit on a request
Recognizing embeddings as numeric vectors capturing semantic meaning, and vector search as nearest-neighbor retrieval
Identifying hallucination, bias, and non-determinism as inherent risks, not edge-case bugs
Matching a described business need (chatbot, summarizer, code assistant) to the generative AI use case category it represents

Written by NPBlue Cloud Team — Cloud & Platform Engineers who runs production workloads on AWS daily and writes from real deployment experience, not the docs alone.

Reviewed for technical accuracy. Spot an error? Let us know.