Step 2 — Bedrock & RAG

Retrieval-augmented generation gets pitched as simple: “just connect your documents to the model.” Anyone who has actually shipped one of these systems knows the pitch is misleading. The model call is the easy 10%. The other 90% — how you split documents, how you embed them, how you retrieve, how you decide what’s “close enough” to be relevant — is where RAG systems either work reliably or quietly hallucinate their way past a demo and into a support ticket queue. This step is about that 90%.

Why RAG Exists in the First Place

A foundation model’s knowledge is frozen at whatever point its training data ended, and it knows nothing about your internal wiki, your product catalog, or last week’s support tickets. RAG solves this without touching the model’s weights at all: at request time, you retrieve the most relevant pieces of your own data and hand them to the model as context, alongside the user’s question. The model still generates the answer — it’s just generating it with your facts sitting right in front of it instead of guessing from memory.

USER QUESTION
     │
     ▼
[ Embed the question into a vector ]
     │
     ▼
[ Search vector store for nearest matching chunks ]
     │
     ▼
Retrieved chunks ──┐
                    ├──► [ Assembled prompt ] ──► Foundation Model ──► Answer
User question ──────┘

Bedrock Knowledge Bases is the managed way to do this on AWS — it handles the ingestion pipeline, chunking, embedding calls, vector store writes, and the retrieval query at answer time, so you’re not hand-rolling all of that plumbing yourself. You still make the design decisions; Bedrock just gives you the wiring.

Chunking: The Decision Everyone Underestimates

Before anything gets embedded, your source documents get split into chunks — because embedding an entire 40-page PDF as one vector loses far too much specificity to be useful for retrieval. Chunk size and boundaries determine whether retrieval later finds the right passage or just something vaguely nearby.

Chunk too small — a sentence or two — and you strip away surrounding context the model would need to interpret the passage correctly. Chunk too large — several pages — and you dilute the vector’s meaning across too many unrelated ideas, so a specific question won’t match it as sharply, and you waste tokens feeding the model text it doesn’t need.

Chunking Strategy	How It Works	Good Fit For
Fixed-size	Split every N tokens, often with overlap	Simple, homogeneous documents; fast to set up
Semantic/paragraph-based	Split on natural boundaries (headings, paragraphs)	Structured docs like manuals, policies, FAQs
Recursive/hierarchical	Split by document structure, falling back to size limits	Long, nested documents (contracts, technical specs)
Document-level with summaries	Keep full doc, also index a summary chunk	Cases where whole-document context genuinely matters

Overlap between consecutive chunks — repeating the last sentence or two of one chunk at the start of the next — is a small trick that pays off disproportionately. It stops you from splitting a sentence or an idea exactly at a chunk boundary and losing the thread on both sides.

There’s no universal “right” chunk size. The only way to know if yours is working is to test retrieval against real questions your users actually ask, not against the documents in isolation.

Embeddings and Where the Vectors Live

Each chunk gets converted into an embedding — a vector of numbers positioned so that semantically similar chunks land near each other in that vector space. The embedding model you choose matters: it needs to match the language, domain, and sometimes even the length characteristics of your content, and — critically — you generally can’t mix vectors from two different embedding models in one index, since “closeness” is only meaningful within a single model’s coordinate space.

Those vectors need somewhere to live that supports fast nearest-neighbor search across potentially millions of entries. On AWS you have real choices here, and they trade off differently:

VECTOR STORE OPTIONS
─────────────────────────────────────────────────────────────
OpenSearch Serverless (vector engine)
  → Purpose-built for search workloads, scales automatically,
    good default when you don't want to manage index sizing

Aurora with pgvector
  → You're already running Postgres, want vectors alongside
    relational data, comfortable managing the instance

Other managed vector-capable stores
  → Chosen when a team already standardizes on that database
    for other reasons and adds vector search as one more feature

If your organization is already running relational workloads on Aurora and the vector search volume is moderate, pgvector keeps everything in one place and one operational model. If you’re building a dedicated, high-throughput retrieval layer decoupled from any existing database, a serverless search-native option removes a lot of capacity-planning work. Neither choice is “the right one” in the abstract — it depends on your team’s existing operational footprint.

Tuning Retrieval Quality

Getting chunking and embeddings right gets you most of the way there, but retrieval quality is really tuned at query time, and there are a few levers worth knowing.

Top-k — how many chunks you retrieve per query — is a tradeoff. Too few and you risk missing the passage that actually answers the question. Too many and you dilute the prompt with marginally relevant content, burn tokens, and give the model more noise to sift through. Most production systems land somewhere in the range of three to eight chunks, tuned against real evaluation data rather than picked arbitrarily.

Similarity thresholds — reject a retrieved chunk if it’s not close enough to the query, rather than always returning your top-k no matter how weak the match. Without a threshold, a question with no good answer in your knowledge base will still return something, and the model may confidently answer from an irrelevant chunk instead of saying “I don’t know.”

Metadata filtering — narrowing the search to a subset of documents based on structured attributes (document type, date, department, access level) before or alongside the vector search. This is often what actually fixes a “wrong answer” bug in practice — not a smarter embedding model, but restricting the search space so an outdated or irrelevant document can’t even be a candidate.

Re-ranking — running a second, more precise (and more expensive) relevance pass over the initial vector search results before handing the final set to the model. Vector similarity is a fast, approximate first pass; a re-ranker can catch cases where the top vector match isn’t actually the most useful passage for answering the specific question asked.

Hybrid Search: Combining Keyword and Semantic Retrieval

Pure vector search has a blind spot: it’s good at semantic similarity but sometimes weak on exact terms — product SKUs, error codes, specific names — where a keyword match would be far more precise than a semantic one. Hybrid search runs both a traditional keyword (lexical) search and a vector search in parallel, then combines the results with a scoring method that blends both signals.

QUERY: "error code E4471 troubleshooting"
        │
        ├──► Keyword search  ──► exact match on "E4471" (high confidence)
        │
        └──► Vector search   ──► semantically similar troubleshooting docs
                    │
                    ▼
          [ Combined, re-ranked result set ]

This matters more than it sounds like it should. Teams that ship pure-vector RAG systems for technical support content are often surprised when a query containing a specific error code or part number retrieves generically similar-sounding documents instead of the one exact match — hybrid search is the direct fix.

RAG Failure Modes Worth Watching For

Even a well-built RAG pipeline degrades in predictable ways. Retrieval can return chunks that are topically related but don’t actually answer the question, and the model may still generate a fluent, confident-sounding answer from them — this is a subtler form of hallucination than “making things up from nothing,” and it’s harder to catch because the output reads as grounded. Stale indexes are another quiet failure: if your ingestion pipeline doesn’t re-embed updated source documents, users get confidently wrong answers based on last quarter’s policy. And retrieval that returns too many low-relevance chunks can bury the one useful passage in noise, especially in longer contexts where positional effects make the model less attentive to content in the middle of the prompt.

Key Skills This Step Builds

Designing a chunking strategy matched to document structure rather than defaulting to one fixed size everywhere
Choosing between vector store options (OpenSearch Serverless, Aurora pgvector, and similar) based on operational fit, not novelty
Tuning top-k, similarity thresholds, and metadata filters against real evaluation queries
Recognizing when re-ranking is worth the added latency and cost
Building hybrid search for content where exact terms (codes, names, IDs) matter alongside semantic meaning
Diagnosing RAG failure modes: stale indexes, weakly relevant retrieval, and confident answers built on the wrong chunk

Written by NPBlue Cloud Team — Cloud & Platform Engineers who runs production workloads on AWS daily and writes from real deployment experience, not the docs alone.

Reviewed for technical accuracy. Spot an error? Let us know.