Retrieval-augmented generation (RAG) is a technique that lets a large language model look things up before it answers. Instead of relying only on what it memorised during training, the model retrieves relevant text from a knowledge source you control (your documents, a database, a policy library), then writes its answer using that retrieved material as evidence. The result is more current, more specific, and easier to fact-check, because you can trace each answer back to a source.
What does RAG actually mean in plain English?
Think of a standard language model as a very well-read colleague answering from memory. They know a lot, but their knowledge is frozen at their training cut-off, and they will occasionally state something confidently that is simply wrong.
RAG hands that colleague a filing cabinet. Before answering, they pull the relevant folder, read it, then reply based on what is actually in front of them. The model still writes the answer in its own words. It just grounds that answer in retrieved facts rather than memory alone.
The term comes from a 2020 paper by Patrick Lewis and colleagues, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, presented at NeurIPS 2020. They paired a text-generating model with a retriever that searches a document index (a vector index of Wikipedia). The paper reported that RAG set the state of the art on three open-domain question-answering tasks and produced "more specific, diverse and factual language" than a generation-only model.
How does RAG work step by step?
Most production RAG systems in 2026 follow the same four stages. AWS sets these out clearly in its RAG explainer, and the pattern is near-universal across vendors.
1. Prepare your data
Your source documents (contracts, product manuals, support tickets, past proposals) are split into chunks and converted into numerical representations called embeddings. These embeddings are stored in a vector database, which is built to find text by meaning rather than by exact keyword match.
2. Retrieve the relevant bits
When a user asks a question, that question is also turned into an embedding. The system searches the vector database for the chunks that sit closest to the question in meaning, and pulls back the top handful. This is the "retrieval" step, and it happens in milliseconds.
3. Augment the prompt
The retrieved chunks are bundled together with the original question into a single, larger prompt. In effect the model is told: "Here is the question, and here is the relevant material. Answer using this." That is the "augmented" part.
4. Generate the answer
The model writes its response using the supplied material as the primary evidence. Because the source chunks travelled with the prompt, a well-built system can cite exactly which document each claim came from, so a human can check it.
Why do businesses use RAG instead of just asking the model?
A plain language model answers from a fixed snapshot of training data. It cannot see your internal documents, and it has no knowledge of anything that happened after its cut-off. Ask it about your refund policy and it will invent something plausible. RAG closes that gap by putting your real, current information in front of the model at the moment it answers.
The other draw is trust. Because RAG grounds answers in retrieved passages, the system can show its working and link back to the source. A 2025 peer-reviewed review of hallucination mitigation, published in the journal Mathematics, found that grounding responses in retrieved evidence is the leading practical method for reducing the rate at which language models fabricate facts.
The original 2020 RAG model set the state of the art on three open-domain question-answering benchmarks and generated more factual language than a comparable generation-only model, according to Lewis et al., NeurIPS 2020.
There is a cost angle too. Retraining a model on your own data is expensive and has to be repeated every time the data changes. RAG leaves the model untouched and simply updates the document index, so keeping the system current is a matter of re-indexing files, not retraining a neural network.
What is the difference between RAG, fine-tuning, and prompting?
These three approaches get confused constantly, and teams often reach for the wrong one. They are not rivals so much as different tools. Prompting changes how you ask. RAG changes what the model can see. Fine-tuning changes how the model behaves. Here is the practical comparison.
Dimension | Prompt engineering | RAG (retrieval-augmented generation) | Fine-tuning |
|---|---|---|---|
What it changes | The instructions you give the model | The information the model can draw on at answer time | The model's own internal weights |
Best for | Formatting, tone, simple task framing | Answering from private or fast-changing knowledge | Durable behaviour: house style, strict output formats |
Handles current or private data | Only what you paste into the prompt | Yes, retrieved live from your sources | No, only what was in the training set |
Can cite its sources | No | Yes, traceable to retrieved documents | No |
Relative cost to set up | Lowest, no infrastructure | Moderate: vector database plus retrieval pipeline | Highest: compute, labelled data, ongoing retraining |
Effort to keep current | Manual, per prompt | Re-index the documents | Retrain the model |
The order most teams settle on is layered, not either-or. Start with good prompting because it is free. Add RAG when answers need to be grounded in your own knowledge. Fine-tune only when you need the model to reliably behave a certain way that prompting cannot pin down. Plenty of production systems use all three at once: a fine-tuned model, fed retrieved context, driven by a carefully written prompt.
When should a UK business choose RAG?
RAG earns its keep whenever the useful answer lives in your documents rather than in the public internet the model was trained on. A few clear signals it is the right fit.
Your knowledge changes often. Price lists, policies, product specs and support articles that update monthly are painful to fine-tune around but trivial to re-index. RAG keeps the answers current without touching the model.
You need answers you can audit. In regulated UK sectors, an unverifiable answer is a liability. Because RAG can point to the exact source passage behind each claim, it suits financial services, legal and healthcare-adjacent work where the ICO's guidance on AI and data protection expects you to explain and document automated outputs.
Your data is sensitive or proprietary. RAG lets the knowledge stay in a store you control, retrieved only when needed, rather than being baked into a shared model. That is often the more defensible position for a data protection impact assessment.
The picture matters at national scale too. Per the ONS Business Insights and Conditions Survey, 25% of UK businesses reported using some form of AI in late December 2025, up 15 percentage points from around 10% when the question was first asked in September 2023. Among firms with 250 or more employees, the figure reached 44%.
That works out to an adoption velocity of roughly 6.7 percentage points a year (Tom & Co analysis of the ONS 2023 and 2025 figures). Grounding techniques like RAG are the most common way those firms make general-purpose models safe to point at their own data.
What are the limits and risks of RAG?
RAG is powerful, but it is not a guarantee of correctness. It reduces hallucination, it does not abolish it. The model can still misread a retrieved passage, or the retrieval step can surface the wrong chunk and the answer will be confidently built on it.
Retrieval quality is the whole ballgame. If your documents are messy, out of date, or poorly chunked, RAG will faithfully ground answers in bad source material. "Garbage in, grounded garbage out" is the failure mode teams underestimate most. The work of curating and maintaining the knowledge base is the real project, not the model itself.
There is also an access-control dimension. A RAG system can only be as safe as the permissions on the documents it retrieves. If the index contains material a given user should not see, the model may surface it. Getting document-level permissions right in the vector store is a genuine security task, not an afterthought.
What should a UK leader do next?
Start small and concrete. Pick one high-volume question your team answers from documents (a support query, an internal policy lookup, a proposal-writing task) and prototype a RAG assistant against just that document set. You will learn more from one narrow working system than from a broad plan.
Get the data house in order first. Decide which documents are the source of truth, who owns keeping them current, and who is allowed to see what. Those three answers determine whether a RAG project succeeds long before any model is chosen.
Then measure it honestly. Track how often the assistant cites the right source and how often a human has to correct it, and compare that against the manual process it replaces. If it saves real time and the answers are traceable, expand it. If not, the problem is almost always the knowledge base, not the model.



