How to Cache LLM Prompts and Responses in Pega to Reduce Token Costs?

lakhank7 · March 17, 2026, 5:06am

In a Pega-based application that leverages generative AI or external LLM integrations, are there recommended approaches to cache prompts and responses to reduce token consumption and associated costs? Specifically, I’m looking for strategies such as reusing previously generated outputs, implementing caching layers within Pega (e.g., data pages, decision strategies, or case-level storage), or optimizing prompt construction to avoid redundant calls.

Additionally, are there best practices for designing such solutions in Pega—like using keyed data pages for prompt-response caching, leveraging embeddings for similarity-based retrieval, or applying memoization techniques—to improve performance and cost efficiency in production environments?

Murali.Krishnan.D · March 17, 2026, 6:56am

Hello Lakhan,

Interesting question. I would like to break this down to two parts

To improve Performance
- we can ensure agents have access to only the needed Tools, so it doesnt waste time checking tools of less significance
- Tune all the tools to be performance efficient, as long running tools can impact performance
- Leverage node level data pages in tools to make use of OOTB pega caching.
Caching
- Pega has OOTB semantic caching for Agents to avoid thinking from scratch for each prompt. I don’t think we have any ootb configuration for us to achieve memoization. Its all built-in.
- Also there is built in context caching / prompt caching.

lakhank7 · March 17, 2026, 9:12am

Thanks for your reply @Murali.Krishnan.D .

What underlying mechanism is used to determine semantic similarity between prompts?

Does it rely on embedding-based similarity (e.g., vector representations), or does it use alternative approaches such as heuristic matching, prompt normalization, or rule-based strategies?

Is there any persistent storage of past prompts and responses involved in this process?

For example, are prompts stored (potentially as embeddings) and matched against incoming prompts to enable reuse of prior responses?

What is the scope of caching?

Is it limited to session/context-level reuse, or does it operate across sessions/users?

When a similar prompt is identified, does the system directly reuse a cached response, or does it augment it with additional reasoning based on the new input?
Are there any observability tools, or documentation (e.g., whitepapers) that you can provide insight into or control over this caching behavior?

Any detailed documentation or architectural guidance on this topic would be very helpful.

Murali.Krishnan.D · March 18, 2026, 9:27am

I was able to find some info on the underlying ML implementation but i will defer that to Pega engineering team to answer to be more accurate. On Observability there are AI tracers, Autopilot conversation cases, and PEGA00xx series of alerts might be helpful.

STEWJ · March 18, 2026, 11:41am

One approach I use to ensure the LLM only receives the exact case data it needs is to deliberately carve that data out using the out‑of‑the‑box case data page.
For my knowledge tool, I create a data page of the same class and use a data transform as the source. In that data transform, I reference the OOTB case data page, passing in the pyID. I then copy the results into a pages‑and‑classes declared page. I then explicitly map only the required properties onto my primary page, which is the tool Data Page.
While LLMs are generally good at navigating page structures, this approach improves both performance and accuracy by ensuring the model receives only the data it actually needs—no more, no less. Any other approaches out there ?

RameshSangili · March 19, 2026, 12:36am

Use Data Transforms for token efficiency for Agents and Coach: Apply Data Transforms to supply only essential information from the Data Page. This reduces the amount of data the Large Language Model (LLM) processes, improving token efficiency and response speed. Sample token usage when conversing with a Coach:

Please follow this link for more details: Trim the Payload: Use Data Transforms to Minimize Token Consumption

Conversation		Replies	Views
GenAI - Cookbook Knowledge Share	0	1348	January 12, 2026
Clipboard Page to Analyse Token and System KPIs while Connecting to Pega AI Agent Rule AI generative-ai , messaging-ai	3	49	March 24, 2026
Trim the Payload: Use Data Transforms to Minimize Token Consumption User Experience	0	132	January 5, 2026
Why/How - Pega's Integration with a Proprietary AI/LLM to get control over data exchange Blueprint and App Design	0	101	December 5, 2025
Security between Pega Knowledge buddy and the LLM it connects to General security , generative-ai , 24-2	1	196	February 13, 2025

How to Cache LLM Prompts and Responses in Pega to Reduce Token Costs?

Related topics