How retrieval-augmented generation works

Definition

RAG lets an AI look up relevant documents from your own knowledge base and answer using them, instead of relying only on what it memorized in training.

At a glance

A retriever finds relevant text; a generator (the language model) writes the answer using it^[2].
Answers are grounded in real source material, so the system can cite where each fact came from^[4].
You update knowledge by changing the documents — no costly model retraining^[3].
By 2025, roughly 30 to 60 percent of enterprise AI use cases ran on RAG^[1].

How it works

RAG has two phases^[5]. First, your documents are split into chunks and stored as numerical “vectors” in a vector database such as Pinecone or pgvector^[6]. Then, at question time, the system finds the most relevant chunks, adds them to the prompt, and the model answers from that context^[7]. Retrieval quality matters more than raw model size: better retrievers can lift answer accuracy by 9 to 19 points^[8].

Where it’s used

Customer-support and internal knowledge bots, with results limited to what each user is allowed to see^[9]; legal and financial research where citations are the deliverable^[1]; and code assistants that read a company’s own repositories.

RAG vs fine-tuning

They solve different problems and pair well^[10]. Use RAG when facts change often or you need citations. Use fine-tuning to change the model’s tone, format, or vocabulary — not its facts.

Bottom line

RAG is the default way to build AI over private or fast-changing information: cheaper than retraining, citable, and only as good as the documents and retrieval behind it.

What is RAG?