What is RAG?
RAG is a method that improves Large Language Models (LLMs) by giving them access to external knowledge. Instead of relying only on what the model was trained on, we can “augment” it with facts from our own documents or databases.
RAG has two main components:
-
Retriever – Finds relevant information from your knowledge base.
-
Generator (LLM) – Uses that information to generate the final answer.
What does the Retriever return?
Here’s the critical part:
-
Inside the vector database, documents are stored as embeddings (vectors).
-
When a query comes in, the system also converts the query into a vector and finds the most similar matches.
-
But the retriever does not pass vectors to the generator.
-
Instead, it passes the matching text chunks (the actual text from your documents).
So the generator always works with text, never raw vectors.
How does the Generator use it?
There are two common types of generation models:
-
Encoder–Decoder models (e.g., T5, BART)
-
Retrieved text + query are fed into the encoder.
-
The decoder then generates the final answer.
-
-
Decoder-only models (e.g., GPT, LLaMA, Mistral)
-
Retrieved text chunks are simply inserted into the prompt along with the query.
-
The single decoder handles both context understanding and answer generation.
-
Visual Diagram of RAG
Here’s a side-by-side view of both approaches:
Key Takeaways
-
Retriever searches with vectors but returns text.
-
Generator needs text, not embeddings.
-
In encoder–decoder models, text goes into the encoder first.
-
In decoder-only models, text is directly added to the prompt.
In short:
👉 User Query → Vector Search → Retrieve Text → Give Text to LLM → Generate Answer