In a Pega-based application that leverages generative AI or external LLM integrations, are there recommended approaches to cache prompts and responses to reduce token consumption and associated costs? Specifically, I’m looking for strategies such as reusing previously generated outputs, implementing caching layers within Pega (e.g., data pages, decision strategies, or case-level storage), or optimizing prompt construction to avoid redundant calls.
Additionally, are there best practices for designing such solutions in Pega—like using keyed data pages for prompt-response caching, leveraging embeddings for similarity-based retrieval, or applying memoization techniques—to improve performance and cost efficiency in production environments?
Interesting question. I would like to break this down to two parts
To improve Performance
we can ensure agents have access to only the needed Tools, so it doesnt waste time checking tools of less significance
Tune all the tools to be performance efficient, as long running tools can impact performance
Leverage node level data pages in tools to make use of OOTB pega caching.
Caching
Pega has OOTB semantic caching for Agents to avoid thinking from scratch for each prompt. I don’t think we have any ootb configuration for us to achieve memoization. Its all built-in.
Also there is built in context caching / prompt caching.
What underlying mechanism is used to determine semantic similarity between prompts?
Does it rely on embedding-based similarity (e.g., vector representations), or does it use alternative approaches such as heuristic matching, prompt normalization, or rule-based strategies?
Is there any persistent storage of past prompts and responses involved in this process?
For example, are prompts stored (potentially as embeddings) and matched against incoming prompts to enable reuse of prior responses?
What is the scope of caching?
Is it limited to session/context-level reuse, or does it operate across sessions/users?
When a similar prompt is identified, does the system directly reuse a cached response, or does it augment it with additional reasoning based on the new input?
Are there any observability tools, or documentation (e.g., whitepapers) that you can provide insight into or control over this caching behavior?
Any detailed documentation or architectural guidance on this topic would be very helpful.
I was able to find some info on the underlying ML implementation but i will defer that to Pega engineering team to answer to be more accurate. On Observability there are AI tracers, Autopilot conversation cases, and PEGA00xx series of alerts might be helpful.
One approach I use to ensure the LLM only receives the exact case data it needs is to deliberately carve that data out using the out‑of‑the‑box case data page.
For my knowledge tool, I create a data page of the same class and use a data transform as the source. In that data transform, I reference the OOTB case data page, passing in the pyID. I then copy the results into a pages‑and‑classes declared page. I then explicitly map only the required properties onto my primary page, which is the tool Data Page.
While LLMs are generally good at navigating page structures, this approach improves both performance and accuracy by ensuring the model receives only the data it actually needs—no more, no less. Any other approaches out there ?
Use Data Transforms for token efficiency for Agents and Coach: Apply Data Transforms to supply only essential information from the Data Page. This reduces the amount of data the Large Language Model (LLM) processes, improving token efficiency and response speed. Sample token usage when conversing with a Coach: