Retrieval-Augmented Generation (RAG)
rag
RAG is the pattern where, before answering a question, the system retrieves relevant text from an external corpus and gives it to the model as context. The model answers from the retrieved chunks rather than from training memory. Most production AI products that "know about your documents" are RAG.
A language model's training data is fixed. It does not know about your company's policies, your codebase, last week's news, or anything specific to a private corpus. RAG is the standard fix: maintain your own document store, retrieve relevant chunks at query time, stuff them into the prompt as context, and let the model answer with the retrieved material grounding the response.
The pipeline:
- Index time: chunk your corpus into 200-1000 token pieces, embed each chunk, store the embeddings in a vector database.
- Query time: embed the user's question, find the nearest k chunks by cosine similarity, retrieve them.
- Generation: build a prompt that includes "Here is relevant context:" + the retrieved chunks + "Answer the user's question using this context."
- Optional: have the model cite the chunks it used, so users can verify.
When RAG is done well, the model's hallucination rate drops sharply on factual queries. The model is no longer guessing from memory; it is reading from documents you just handed it. The classic failure mode (confident citation of papers that do not exist) goes away when the system is forced to ground in a real corpus.
When RAG is done poorly, the failure modes are: bad retrieval (the wrong chunks come back), loss in the middle (the model ignores chunks placed in the middle of a long context), and over-confident integration (the model paraphrases the chunks into something the chunks do not actually say).
The fixes for poor RAG are mostly engineering: better chunking strategies, better embeddings, reranking models that re-score retrieved candidates, and prompt patterns that force the model to quote its sources. RAG quality is determined as much by the retrieval engineering as by the language model on top.
Modern alternatives include very-long-context models (just put the whole document in the prompt), tool-use patterns (let the model call a search API itself), and hybrid approaches. RAG is no longer the only answer, but for corpora bigger than a single context window it remains the default.
Related concepts