Concepts

What Is RAG?

How retrieval-grounded systems answer from evidence, and how to tell when that architecture is actually the right fit.

The Basic Idea

The basic problem is access. A model can answer from its training data and from whatever you put in the prompt, but it cannot answer from your policies, PDFs, notes, or records unless the system has some way to reach those materials. Demos that look impressive on a model's training data are much less useful on your own files.

RAG (Retrieval-Augmented Generation) is one common way to provide that access. The system looks up relevant material from a bounded collection and answers with that evidence in view. When people say a system is grounded, they mean its answers are tied to retrievable sources rather than generated from the model's training data alone.

For this series, the practical question is whether the work really calls for a retrieval layer at all. A bounded collection, repeated access, and a need to tie answers back to inspectable documents can make the case quickly. But people often assume RAG requires specialized infrastructure before checking whether a simpler approach would do.

Why Retrieval

The main alternative to retrieval is retraining or fine-tuning a model (adjusting its internal weights by running it through new data), which is expensive, slow, and out of reach for most teams without dedicated ML infrastructure. Retrieval avoids that. When your policies change, you update the document collection. You do not have to retrain the model. The system can also point to specific sources, which means users can verify what it found. And because retrieval and generation are separate steps, you can swap one collection for another without changing the model.

Retrieval can work two ways. The standard pattern pre-indexes a collection and searches it at query time. The alternative, often more practical in agentic work, lets the model use tools to look things up on demand.

The Standard Pattern

Your Question
Retrieval
Generation
Answer
Your Document Collection
(policies, reports, records, etc.)

The retrieval step searches your documents for relevant passages, then the generation step produces an answer that draws on those passages.

Most RAG systems follow the same basic sequence, even if the software stack and naming conventions differ:

  1. Ingestion: Your documents are split into passages (often called "chunks"), then processed and stored in a searchable format. The chunk size matters: too large and retrieval becomes imprecise, too small and passages lose context.
  2. Retrieval: When you ask a question, the system finds the most relevant passages from your documents.
  3. Augmentation: Those passages are combined with your question into a prompt for the AI.
  4. Generation: The AI generates an answer, drawing on the retrieved information.

How Retrieval Works

You do not need to know the internals. You need to know where retrieval can fail. If the system fetches the wrong passages, the answer can still sound composed even though it is built on the wrong evidence. For anyone with catalog or database experience, the tradeoff will feel familiar. Keyword search is good for exact terms. Vector search is good for concept-level matching. Most practical systems combine the two.

Keyword Search

This is the most familiar approach. The system looks for exact matches to the terms in your query. A search for "dog" will find documents containing "dog" but may miss documents that only say "puppy" or "canine." Keyword search is fast and predictable, and the system's behavior is easy to understand. It is bad at recall. If you and the document collection use different words for the same idea, the relevant passages do not show up in the results.

Vector Search (Embeddings)

Vector search tries to match meaning across differently worded passages. The core idea is that text gets converted into a vector (also called an embedding), a long list of numbers that functions like coordinates in a high-dimensional space. If two passages mean similar things, their vectors are close together in that space, even when the wording is different. (If you have worked with subject headings, you can think of it roughly as automated concept collocation, with the same benefits and some of the same hazards.) A machine learning model generates these vectors, and they are stored in a specialized vector database optimized for comparing them. When you search, your query is also converted to a vector, and the system finds passages whose coordinates are closest to yours.

The result is that a search for "dog" might also retrieve text about "puppies" or "canines," since the system is matching by meaning as well as wording. Vector search has its own problems. It can return passages that are thematically related without answering the question. And the reason a particular passage was retrieved can be harder to inspect than a keyword match. Many systems combine the two approaches, using keyword matching for precision and vector similarity for broader recall.

RAG in Practice

Library Policy Chatbot

A library connects an AI to its policy documents, procedures, and FAQs.

Patron asks
"Can I renew interlibrary loans?"
AI responds (with RAG)
"According to our ILL policy (updated January 2025), most interlibrary loans can be renewed once for an additional two weeks. Exceptions include items marked 'no renewal' by the lending library. To request a renewal, contact the ILL desk at least 3 days before the due date."

The answer is tied to the library's actual policy document, so a staff member can check whether it reflects what was adopted in January.

Research Assistant

A researcher connects an AI to their collection of PDFs: journal articles, reports, notes.

Researcher asks
"What methods have been used to measure community engagement in public libraries?"
AI responds (with RAG)
"Based on the articles in your collection: Johnson (2022) used surveys combined with circulation data; the IMLS study (2023) developed a composite index including program attendance, computer use, and meeting room bookings; Chen et al. (2024) proposed qualitative interviews as a complement to quantitative metrics..."

Every citation traces back to a document in the researcher's own collection. A general model response would give you nothing to verify against.

Tool-Based Retrieval

Everything above assumes a pre-indexed collection, which means upfront work: splitting documents, generating embeddings, choosing and configuring a vector database. For a small, stable collection this setup can take an afternoon. Larger or more complex collections can require dedicated engineering time, ongoing maintenance, and nontrivial API costs for embedding. That investment is sensible when the same collection will be searched repeatedly and the documents do not change often.

Systems can also use tools to find information on demand, without processing documents in advance. The grounding goal is the same. The system looks things up live instead of using a pre-built index. This works better when the sources are changing or are spread across different systems.

An Example

When Claude Code helped research this guide, it used live lookup with no vector database in the middle. It:

This is tool-based retrieval, sometimes called "agentic RAG." There is no indexing pipeline and no vector database. The model decides what to look up next.

When Each Approach Fits

Traditional RAG works well for stable collections you want to search repeatedly: policies, manuals, archives. The upfront setup (indexing pipeline, embedding model, vector database) involves work, but once the index exists the ongoing cost per query is low. Tool-based retrieval tends to fit better when sources are changing, mixed, or distributed across systems, since it skips the indexing pipeline entirely.

When RAG Fits

RAG fits when your answer depends on a specific collection and readers need to see what the answer was based on. A library with stable policy documents and a researcher working from a personal corpus of PDFs share two conditions. The evidence is bounded, and readers need to trace answers back to inspectable sources. Without those conditions, the retrieval layer's overhead is hard to justify.

If the relevant material fits in a prompt, or if the task does not involve retrieval at all (writing, brainstorming, coding), you are adding infrastructure you do not need. For live or rapidly changing information, tool-based retrieval works better, since it can reach current sources without waiting on a re-indexing cycle.

Limitations

Grounding Helps, but It Does Not Remove Judgment

Even with retrieval, the model may misread retrieved material or fill gaps with unsupported claims. If the underlying documents disagree, it may blend them into something misleading. Retrieval quality matters. If the system retrieves the wrong passages, the answer is confidently built on the wrong evidence. The state of the collection (how it is organized, how current it is, how internally consistent) determines what the system can find.

Further Reading