RAG
RAG vs Fine-Tuning vs Long Context: How to Ground Your AI
By Niall · 8 min read
Retrieval, fine-tuning and long context are not rivals; they solve different problems, and the best systems use all three.
A general-purpose model knows a great deal about the world and almost nothing about your business. The work of making it useful, getting it to answer from your documents, in your style, about your products, is called grounding, and there are three main ways to do it: retrieval, fine-tuning, and long context.
They are often presented as rivals. In practice they solve different problems, and most production systems use more than one. Understanding what each is actually good at saves you from the common mistake of reaching for the expensive option when a simpler one would do.
RAG: retrieval at query time
Retrieval-augmented generation, RAG, keeps your knowledge in a searchable store and pulls the relevant pieces in at the moment of the question. The model answers using those retrieved passages, so it can cite its sources and stay current as your documents change. Update a document and the next answer reflects it, no retraining required.
That makes RAG the natural fit for knowledge that changes, policies, prices, product details, documentation, and for any case where you need citations and the ability to say 'I cannot find that'. It is the workhorse of grounded AI for good reason.
RAG also keeps your knowledge auditable. Because the answer points back to a specific passage in a specific document, anyone can check it, and you fix a wrong answer by fixing the source rather than retraining anything. For regulated or high-stakes work, that traceability is often the deciding factor.
Fine-tuning: teaching the model itself
Fine-tuning adjusts the model's own weights by training it on examples. It does not so much add facts as teach behaviour: a consistent format, a particular tone, or a narrow, repetitive task done reliably. If you need every answer to follow a strict structure, or want a small model to handle one specialised job well, fine-tuning earns its place.
What fine-tuning is not good at is fresh, changing facts. Baking knowledge into weights means re-training to update it, which is slow and costly compared with editing a document. Use it to shape how the model behaves, not to store what it should know.
It is also worth saying that fine-tuning has become less necessary than it once was. Capable base models follow instructions well, and a good prompt plus retrieval covers many cases that used to need training. Reach for fine-tuning when prompting genuinely cannot get you the consistency you need, not as a first move.
Long context: just fit it in the prompt
Context windows have grown enormously, with some models, such as Gemini, handling up to around two million tokens. That is large enough to drop entire handbooks or codebases straight into a single prompt and ask questions across all of it, with no retrieval pipeline to build. For one-off analysis of a big document, it is wonderfully simple.
Long context also has practical limits beyond cost. Models can lose track of detail buried in the middle of a very long prompt, so simply having room for two million tokens does not guarantee the model uses all of them well. Treat the window as a tool for whole-document reasoning, not a dumping ground.
The cost and freshness trade-offs
- RAG: cheap per call, easy to keep current, gives citations, but needs a retrieval pipeline built and tuned.
- Fine-tuning: great for style and narrow tasks, but expensive to update and poor for changing facts.
- Long context: no pipeline and simple to start, but you pay for every token every time and latency rises with size.
There is no universal winner. The right choice falls out of three questions: how often your data changes, how much you can spend per call, and how much accuracy and traceability the task demands.
Notice that these are engineering trade-offs, not a ranking. The same system can be cheap and current in one place and slow and thorough in another, depending on which technique you applied where. Naming the constraint that matters most for each task is what turns the choice from a debate into a decision.
Why production systems combine them
In real systems these techniques are partners, not competitors. A common pattern is RAG for current, citable knowledge, a fine-tuned model to keep the format and tone consistent, and long context used selectively when a single large document needs to be read whole. Each covers the others' weaknesses.
For example, a support assistant might retrieve the relevant policy with RAG, rely on fine-tuning to answer in your house style, and occasionally pull a full contract into context when a question spans the entire document. The art is matching each technique to the part of the problem it handles best.
The mistake is treating the choice as a one-time architectural decision. In practice you start with the simplest technique that could work, usually RAG, measure where it falls short, and add fine-tuning or long context only to close the specific gaps you can actually see.
How to choose
- Knowledge changes often and you need citations: start with RAG.
- You need a consistent format or a narrow, repeated task done well: consider fine-tuning.
- You occasionally need to reason over one big document: use long context for that step.
- Most of the time: combine them, led by your data, your budget and your accuracy needs.
Grounding is the difference between a model that sounds confident and one that is actually right about your business. Choosing well among retrieval, fine-tuning and long context, and combining them sensibly, is most of what makes a chatbot or assistant trustworthy. Designing that grounding for your data and budget is exactly what we do when we build AI chatbots.
Relevant services

