Definition
Retrieval-augmented generation (RAG) is a technique that retrieves relevant documents from an external knowledge base and feeds them to a large language model at query time.
Key takeaways
- RAG combines a retriever, which fetches relevant text from a knowledge base, with a generator, which is a large language model.
- The retrieved passages are inserted into the prompt so the model can ground its answer in source material it did not see during training.
- Compared with fine-tuning, RAG is cheaper to update, easier to govern, and provides source attribution out of the box.
- By 2025, roughly 30 to 60 percent of enterprise generative-AI use cases relied on RAG[1].
- Common production uses include customer-support assistants, internal knowledge bots, legal and financial research, and code search.
What is RAG?
Retrieval-augmented generation pairs an information-retrieval component with a sequence-to-sequence generator so that a language model conditions its output on documents fetched at query time[2]. The approach was introduced in 2020 by a team at Meta AI Research and University College London, who described RAG as a way to combine the model’s parametric memory — the knowledge stored in its weights — with non-parametric memory, an external vector index of source documents[3]. The non-parametric side can be updated without retraining the model, which is the property that makes RAG attractive to organizations whose data changes regularly.
The motivation is straightforward. Standalone language models are trained once on a fixed snapshot of text and then frozen. They have no built-in way to consult new information, no native mechanism for citing sources, and a tendency to invent plausible-sounding statements when asked about facts they do not reliably know. RAG addresses each of those gaps by routing the question through a search step before generation, so the answer is grounded in retrievable, citable text[4].
How does RAG work?
A RAG system runs in two phases: an offline indexing phase and an online retrieval-and-generation phase[5].
In the indexing phase, source documents are split into chunks, each chunk is converted into a numerical vector using an embedding model, and the resulting vectors are stored in a vector database such as Pinecone, Weaviate, Chroma, or the open-source pgvector extension for Postgres[6]. Vector databases use approximate-nearest-neighbor indexes to return semantically similar chunks in milliseconds across collections of millions of documents[6].
In the online phase, the user’s question is embedded with the same model used during indexing. The system then performs a similarity search to retrieve the top-k most relevant chunks, appends them to the prompt as context, and sends the augmented prompt to the language model[7]. The model generates an answer that draws on both its pre-trained knowledge and the retrieved passages. Because the retrieved chunks are explicit, the system can show users exactly which sources the answer was based on.
Retrieval quality matters more than model size in many practical settings. The 2020 paper introducing Dense Passage Retrieval, which is widely used as the retriever in RAG systems, reported a 9 to 19 percentage-point absolute improvement in top-20 passage accuracy over the BM25 keyword baseline on open-domain question-answering benchmarks[8].
Examples
Production deployments of RAG span several recurring patterns. Customer-support assistants use RAG to answer questions from product documentation and past support tickets, so the response reflects the current state of the product rather than a model snapshot from months earlier[1]. Internal knowledge bots retrieve from HR policies, engineering wikis, and meeting notes, and can restrict what is returned based on the requesting user’s access rights[9]. Legal and financial teams use RAG to query large bodies of regulation, filings, and case law, where the citation trail is itself part of the deliverable[1]. Code assistants retrieve from a company’s own repositories so the model can write code that uses internal libraries it would otherwise know nothing about.
RAG vs fine-tuning
RAG and fine-tuning solve different problems and are often used together[10]. RAG injects external knowledge at query time without changing the model; fine-tuning adjusts the model’s weights using examples from a specific domain[10]. RAG is the better fit when the underlying information changes often, when source attribution is required, or when the same model needs to serve multiple knowledge bases. Fine-tuning is the better fit when the goal is to change the model’s behavior — its tone, output format, or grasp of domain vocabulary — rather than its facts.
For most enterprise deployments the practical sequence is to start with RAG, observe how the system is actually used, and reach for fine-tuning only for the highest-value workflows where behavior change justifies the additional cost and complexity.
Bottom line
RAG is the default architecture for building language-model applications over private or fast-changing information. It is cheaper than retraining, supports citations, and adapts as the underlying corpus changes. The hard parts are no longer the language model: they are the quality of the source documents, the chunking and embedding strategy, and the retrieval system that decides what the model gets to see.
Citations
[1] Enterprise RAG Predictions for 2025 — Eva Nahari — Vectara https://www.vectara.com/blog/top-enterprise-rag-predictions [2] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela — arXiv https://arxiv.org/abs/2005.11401 [3] Retrieval Augmented Generation: Streamlining the creation of intelligent natural language processing models — Meta AI https://ai.meta.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/ [4] What is retrieval-augmented generation? — Kim Martineau — IBM Research https://research.ibm.com/blog/retrieval-augmented-generation-RAG [5] Build a Retrieval Augmented Generation (RAG) App — LangChain https://docs.langchain.com/oss/python/langchain/rag [6] Vector Databases for RAG: Comparing pgvector, Pinecone, Chroma, and Weaviate — CallSphere https://callsphere.ai/blog/vector-databases-rag-pgvector-pinecone-chroma-weaviate [7] Retrieval Augmented Generation — Amazon SageMaker AI Developer Guide — Amazon Web Services https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-customize-rag.html [8] Dense Passage Retrieval for Open-Domain Question Answering — Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih — arXiv https://arxiv.org/abs/2004.04906 [9] What is Retrieval-Augmented Generation (RAG)? — Amazon Web Services https://aws.amazon.com/what-is/retrieval-augmented-generation/ [10] RAG vs. fine-tuning — IBM https://www.ibm.com/think/topics/rag-vs-fine-tuning
References
- 1.Enterprise RAG Predictions for 2025 — Eva Nahari. Vectara. www.vectara.com
- 2.Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. arXiv. arxiv.org
- 3.Retrieval Augmented Generation: Streamlining the creation of intelligent natural language processing models. Meta AI. ai.meta.com
- 4.What is retrieval-augmented generation? — Kim Martineau. IBM Research. research.ibm.com
- 5.Build a Retrieval Augmented Generation (RAG) App. LangChain. docs.langchain.com
- 6.Vector Databases for RAG: Comparing pgvector, Pinecone, Chroma, and Weaviate. CallSphere. callsphere.ai
- 7.Retrieval Augmented Generation — Amazon SageMaker AI Developer Guide. Amazon Web Services. docs.aws.amazon.com
- 8.Dense Passage Retrieval for Open-Domain Question Answering — Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. arXiv. arxiv.org
- 9.What is Retrieval-Augmented Generation (RAG)?. Amazon Web Services. aws.amazon.com
- 10.RAG vs. fine-tuning. IBM. www.ibm.com