What is RAG and why it’s needed?

This article tries to answer two questions: what is RAG and why we even need it. Before we can understand why Retrieval-Augmented Generation matters, we need to understand the fundamental mechanics of Large Language Models — and specifically, where they break down. Let’s start with the fundamental constraint that makes RAG necessary.

Every LLM has a context window — a fixed-size buffer that represents everything the model can “see” at any given moment. This includes the system prompt, the user’s input, any injected context, and the model’s own output. All of it must fit within this window, measured in tokens. A 200K-token window can technically hold a 300-page document. However, as we will see in the next section, “can hold” and “can effectively use” are very different things.

Here is where we arrive at one of the limitations of LLMs: the model does not pay equal attention to all parts of the context window.

Research has consistently demonstrated a phenomenon known as “Lost in the Middle.” When presented with a long context, LLMs exhibit a strong U-shaped attention pattern: they attend carefully to content at the beginning and the end of the input, but information placed in the middle receives significantly less attention and is more likely to be ignored, misinterpreted, or hallucinated over.

The practical implication is severe: you cannot treat the context window as a simple bucket. Dumping an entire knowledge base, a complete case history, or all of your policy documents into the prompt and hoping the model will find the right answer is not a viable strategy.

Here are the real-life consequences of the described phenomena:

  • Knowledge Cutoff (the model’s training data has a fixed end date)
  • Hallucinations
  • Attention degradation

Retrieval-Augmented Generation

The Core Tension: LLMs need context to give good answers, but they cannot handle unlimited context effectively. We need a mechanism that gives the model exactly the right information at exactly the right time — and nothing more.

This is where Retrieval-Augmented Generation enters the picture.

Retrieval-Augmented Generation, or RAG, is an architectural pattern that solves the context problem by fundamentally changing how we deliver information to an LLM. Instead of pre-loading the model with everything it might need, RAG retrieves only the most relevant pieces of information at the moment a question is asked, then feeds those pieces — and only those pieces — into the model’s context window.

RAG operates in two stages. First, the system retrieves relevant content from a knowledge store. Then, the LLM generates a response grounded in that retrieved content. This two-step pattern — Retrieve, then Generate — is the conceptual core of every RAG implementation.

A RAG system has two phases: an offline preparation phase (indexing) and a real-time query phase (retrieval and generation).

Phase 1: Indexing (Offline)

Before a RAG system can answer any questions, the knowledge base must be prepared. This involves three steps:

First, chunking: the source documents are split into meaningful segments — typically paragraphs, sections, or logical units of 200–500 tokens each.

Second, embedding: each chunk is converted into a numerical vector representation (an “embedding”) that captures its semantic meaning.

Third, storing: these vectors are stored in a vector database, creating a searchable index of semantic meaning across the entire knowledge base.

Phase 2: Retrieval and Generation (Real-Time)

When a user asks a question, the system executes a three-step pipeline:

  • RETRIEVE - Embed the query, search the vector store, return top-K relevant chunks
  • AUGMENT - Inject retrieved chunks into the LLM prompt as grounding context
  • GENERATE - LLM produces an answer strictly based on the provided context

The user’s query is first embedded using the same embedding model that indexed the knowledge base. This query vector is then compared against all stored vectors using similarity search, and the top-K most relevant chunks are returned. These chunks are injected into the LLM’s prompt alongside the original question, with instructions to answer based on the provided context. The LLM then generates a response grounded in actual source material rather than its own parametric knowledge.

RAG directly addresses limitations we identified at the beginning of article:

  • Beats the context window limit. Instead of stuffing everything into the prompt, you inject only 3–5 highly relevant chunks. A 4,000-token context of focused, relevant content outperforms 100,000 tokens of everything-and-the-kitchen-sink.
  • Avoids the dump zone. With a small, targeted context, the model’s attention is concentrated. There is no “middle” to get lost in when your context is a handful of focused paragraphs.
  • Always current. When you update an article in your knowledge base, the next query retrieves the updated content. No retraining. No fine-tuning. Just update the source and re-index.
  • Reduces hallucinations. The model is explicitly instructed to answer based on the provided context. With grounding material present, it has less reason and less opportunity to fabricate.
  • Works with proprietary data. Your enterprise knowledge never leaves your control and never needs to be embedded into the model’s weights. It lives in your knowledge base and is retrieved on demand.

While the concept of RAG is straightforward, the quality of a RAG implementation depends heavily on a set of design decisions. To mention them without going into much of technical details:

  • Chunking strategy is paramount. Chunks that are too small lose context and produce fragmented answers. Chunks that are too large introduce noise and waste precious context window space.
  • Hybrid retrieval combines vector similarity search (“what does this mean?”) with traditional keyword search like BM25 (“does this contain these exact words?”). Hybrid approaches consistently outperform either method alone.
  • Re-ranking is an optional but valuable step where a secondary model scores and reorders the initial retrieval results before they are passed to the LLM.

Pega Knowledge Management and Pega Knowledge Buddy

Pega’s Knowledge Management capability provides the structured content layer that underpins RAG within the Pega ecosystem. At its core it is a managed repository of articles, policies, procedures, FAQs, and other knowledge assets organized by categories and content types.

For architects who have worked with Knowledge Management primarily as a tool for human agents — surfacing help articles in case worker portals, building searchable FAQ libraries — the shift to RAG-powered AI requires a reframing. Knowledge Management solution is no longer just for people. It is becoming the primary source of truth for AI reasoning.

Knowledge Buddy, is Pega’s implementation of RAG — and it goes further than a standard RAG pipeline by adding secure, governed access to knowledge content. For a more comprehensive description, see the official documentation: https://docs.pega.com/bundle/knowledge-buddy/page/knowledge-buddy/implementation/buddy-overview.html. In this article however we focus on RAG capabilities and under the hood, Knowledge Buddy handles the full RAG pipeline. It chunks your knowledge articles, generates embeddings, stores them in a vector index, and at query time retrieves the most relevant chunks to feed into the LLM for answer generation. Architects retain control over chunking configuration and text preprocessing — the levers that most directly affect retrieval quality.

From an architect’s perspective, Knowledge Buddy abstracts away the complexity of building a RAG pipeline from scratch. You do not need to choose embedding models, configure vector databases, or write retrieval logic. Pega handles the infrastructure. What you do need to focus on is the quality and structure of the content in your knowledge base — because Knowledge Buddy can only retrieve what exists there.

Knowledge Buddy integrates with Pega’s broader GenAI capabilities, including GenAI Coach and Pega Agents. This means that the same Knowledge Base content can serve multiple AI-powered touchpoints across the application, creating a single source of truth that is consistently leveraged throughout the platform. For teams managing content outside of Pega Knowledge, Knowledge Buddy also exposes REST APIs for content ingestion, semantic search, and feedback collection.

Agentic AI and Knowledge Buddy

In an agentic architecture, the LLM does not just answer questions — it reasons, plans, and takes multi-step actions autonomously. It can decompose a complex request into subtasks, decide which tools or data sources to consult, execute actions, evaluate results, and iterate. In Pega’s context, this means AI agents that navigate case workflows, trigger business rules, invoke integrations, and orchestrate decisions across the platform — not just respond to queries.

Here is where the fundamental limitation of LLMs becomes acutely relevant for agentic architectures: LLMs are stateless. Every interaction starts from a blank slate. The model retains no memory between conversations. For a simple question-answering tool, statelessness is manageable. For an AI agent that needs to handle complex, multi-day enterprise processes — cases that evolve over weeks, customer relationships that span years, compliance requirements that build on precedent — the lack of persistent memory is a fundamental constraint. This is where the reframing becomes powerful. Consider Knowledge Buddy not merely as a “search assistant” that helps people find articles, but as the external memory architecture for agentic AI within Pega.

While you have your case data that you will give to the agent, there are many use cases where feeding additional data that resides outside of your cases is useful. This is where the integration between Agent and Knowledge Buddy appears. AI agent uses Knowledge Buddy to retrieve relevant knowledge from the Knowledge Base on demand. In practice, the agent invokes Knowledge Buddy as a tool call — a pattern consistent with how modern agentic frameworks operate. The context window is the agent’s working memory: small, focused, and ephemeral. The Knowledge Base, accessed via Knowledge Buddy, is the agent’s long-term memory.

7 Likes

Great article to understand context. Considering Pega Agents can leverage Knowledge Buddies, it would be great to understand under what use cases it would be best to combine them and use a buddy as a tool for an agent.

1 Like

Great articulation of why RAG is not a workaround, but a foundational architectural pattern for enterprise AI. The “lost in the middle” problem makes it clear that scale without retrieval discipline only amplifies risk, not intelligence. And for us, framing Knowledge Buddy as governed long‑term memory for agentic AI is especially powerful. Great article!

1 Like

I have seen Agents going a tad screwy a lot, esp. the more the conversation runs on. The better my prompt at the beginning, the better the results.

What I am interested in is how do we (or is it the LLM?) determine the “end” of a conversation? If results are better at the end, how do we (or can we?!) influence it?

Maybe it’s like talking to a child - “This is your last warning, or you’re off to bed!”

Interested in people’s thoughts!

1 Like

In light of Knowledge buddy and Agentic AI, RAG can be reused in a very instrumental way to update agent instructions dynamically,
Have done something similar to fetch agent cardinal directives directly from PegaKB using KB tool in agent, and doing tasks based on that.
This inherently makes the agent more flexible having dynamic instructions to serve us better.