Be careful using data pages as Knowledge Sources for GenAI Agents

When building GenAI-powered agents in Pega, a common and intuitive approach is to expose enterprise data through a data page and use that as a knowledge source.

At first glance, this seems perfectly reasonable.

However, when working with large datasets (e.g., ~60,000 records) and asking agents to perform aggregations and summaries, this approach can produce:

  • Inconsistent results

  • Poor performance

  • Excessive token usage

In contrast, Pega GenAI “Chat With Your Data” takes a fundamentally different approach—one that is both more efficient and more reliable.

This post walks through a real example comparing both patterns.

:warning: Note: This testing was performed on Pega Infinity 25.1.2. Behavior and capabilities may evolve in future releases.

The Scenario

We use a dataset of approximately 60,000 customer records and ask a simple question:

“Show me a count of customers by state in the New England region, and break that down by city.”

This is a classic aggregation query—something a database is designed to handle efficiently.

Approach 1: Data Page–Driven Agent

Agent Design

The agent is configured to:

  • Use a data page tool (customerdata)

  • Retrieve records and perform aggregation inside the LLM

You are an application-level assistant that helps users explore and understand customer data provided via a Data Page. Use the Data Page context to answer questions, summarize key customer facts (profile, status, recent activity, risks, and notable changes), and explain the “why” behind any insights using only the data you are given. If the user’s question can’t be answered from the available Data Page content, ask a short follow-up question describing exactly what data is missing, or suggest what additional source/tool (if available) would be needed—without guessing or fabricating details.

Use the customerdata tool to get a list of customers.

If you are being asked to get a count, summary, or aggregate value get the list of customers from the customerdata tool then calculate what you were asked for using the list that is returned by the tool.


Tool Invocation Requires Reinforcement

To even get reliable behavior:

  • The same questions were added to:

    • Quick Select prompts

    • Tool example phrases

Caption:
Tool usage had to be explicitly reinforced through both quick-select prompts and tool example phrases. Without this, the agent did not consistently invoke the data page.


Observed Behavior

1. Default Data Page Limitation

Caption:
Default data page limits restrict the number of rows returned, meaning aggregation is often performed on incomplete data.

:right_arrow: The agent is often not operating on the full dataset


2. Inconsistent Results Across Runs

The same exact question was asked multiple times against the same agent.

Repeated executions of the same query produce different results, including varying counts and distributions.

:right_arrow: Observed issues:

  • Totals differ between runs

  • State distributions are inconsistent

  • City-level breakdowns do not reconcile

Why this happens

The agent is following this pattern:

  1. Call the data page

  2. Retrieve a list of records (often incomplete)

  3. Perform aggregation inside the LLM

Variability in:

  • Returned row subsets

  • Tool invocation behavior

  • LLM reasoning

:right_arrow: Leads to non-deterministic outputs

3. Token Inefficiency

Caption:
The data page approach generates extremely high token usage (~120K tokens) due to retrieving large record sets and reprocessing them in the LLM.

:right_arrow: The LLM is effectively doing the job of a database query engine


Root Cause

The issue is architectural:

Retrieve records → Then aggregate in the LLM

This results in:

  • Large payload transfers

  • High token consumption

  • Inconsistent outputs

  • Poor scalability

Approach 2: Chat With Your Data

Agent Design (Governed and Constrained)

You are a governed analytics assistant for demo customer data. Your responsibility is to use the “chat with your data” capability to produce summaries and aggregated query results for the Customer data type.

Hard scope restriction: You MUST ONLY access data from the class Demo-AIUseCases-Data-Customer. Do not access or reference any other class, case type, data type, external system, or knowledge source for customer record retrieval.

Default behavior: Prefer aggregated answers (counts, distributions, averages, min/max, percentiles, group-bys, top-N, trends) over returning raw records. If the user asks for “all customers,” “export,” “full list,” or any unbounded record dump, redirect them to request a summarized or aggregated query instead.

Record volume rule: If fulfilling a request would return more than 30 records, you MUST warn the user before returning results and propose a summarized alternative.

Explainability: Clearly state (a) what filters you applied, (b) what aggregation you performed, and (c) what the results mean in plain language.

Context discipline: Keep responses concise and avoid returning large text or large tables.


Important: Constraining the Data Scope

One key difference with Chat With Your Data is that the agent will attempt to determine which data class to query unless explicitly constrained.

Example Risk

If your system contains:

  • SMB-Customer

  • Consumer-Customer

An unconstrained agent may:

  • Select the wrong data class

  • Misinterpret the scope of the question

:right_arrow: Leading to incorrect aggregates and misleading results


Why This Matters

Unlike data pages, which are fixed:

Chat With Your Data is flexible by design—but requires governance

That’s why this instruction is critical:

Hard scope restriction: You MUST ONLY access data from the class Demo-AIUseCases-Data-Customer.


Best Practice

  • Always explicitly constrain the data class

  • Do not rely on naming or inference

  • Treat data source selection as a governance decision


Observed Behavior

1. Consistent, Accurate Results

Caption:
Chat With Your Data produces consistent, fully reconciled results with clear aggregation logic and structured output.

Example results:

  • Total customers: 6,517

  • CT: 288

  • MA: 5,420

  • NH: 527

  • VT: 282

:right_arrow: City totals correctly roll up to state totals
:right_arrow: Results remain consistent across executions


2. Structured and Explainable Output

The response includes:

  • Filters applied

  • Aggregation performed

  • Structured summaries

  • Clear explanation of results

:right_arrow: Outputs are both deterministic and explainable


3. Token Efficiency

Caption:
Aggregation is executed at the data layer, dramatically reducing token usage and eliminating large payload transfers.

:right_arrow: No full dataset retrieval
:right_arrow: No token spikes
:right_arrow: More efficient execution

Why This Works Better

The difference is architectural:

Data Page Pattern

Retrieve all data → Aggregate in the LLM

Chat With Your Data Pattern

Generate query → Aggregate in the database


Key Takeaways

:white_check_mark: Data Pages Work Well For

  • Record-level access

  • Small datasets

  • Customer detail exploration


:warning: Use Caution with Data Pages For

  • Large datasets

  • Aggregation queries

  • Analytical use cases


:white_check_mark: Chat With Your Data Is Ideal For

  • Aggregations and summaries

  • Large datasets

  • Consistent, explainable outputs

  • Efficient execution


Final Thought

This was not a misconfiguration.

  • The agent was clearly instructed

  • Tool usage was reinforced

  • Example phrases were aligned

Yet the limitations persisted.

Because the issue is not configuration—it is architecture.

If you ask an LLM to do a database’s job, you will pay for it—
in tokens, performance, and correctness.


Recommendation

When building GenAI-driven analytics in Pega:

  • Be intentional about where computation happens

  • Prefer database-driven aggregation over LLM reasoning

  • Use Chat With Your Data for scale

  • Always apply strict data scoping guardrails

Love this Joe. Perfect way to ensure to not use a probabilistic model to do deterministic task. This way we can achieve separation of concerns (soc) for llm and database and not commiting an anti-pattern.