Non-Prod environments frequently going down with high heap usage even with traffic and with out traffic.
Please find attached doc for details.
NonProd environment issue.docx (85.7 KB)
Non-Prod environments frequently going down with high heap usage even with traffic and with out traffic.
Please find attached doc for details.
NonProd environment issue.docx (85.7 KB)
if you still need help please log an INC via the MSP and provide the INC reference here.
This is a GenAI-powered tool. All generated answers require validation against the provided references.
Based on the information you shared, the primary issue appears to be related to the high number of batch requestors that remain in memory even when there’s no traffic. The data you provided shows 3,561 total requestors with 2,240 batch requestors, but only 35 active requestors. This significant discrepancy is a clear indicator of the problem.
In Pega environments, batch requestors created by background processes like Queue Processors, Agents, and Job Schedulers are not automatically cleared from memory. Instead, they remain in memory and are reused until the node is restarted. This explains why your heap usage remains consistently high (>80%) even without traffic, and why it quickly maxes out when traffic is introduced.
Here are my recommendations to address this issue:
The most direct solution is to implement proper cleanup of orphaned batch requestors. Recent Pega Platform updates (specifically 24.2.1) include fixes for this exact issue. According to Pega support documentation: “An excessive number of batch requestors were causing performance issues and exceeding the node limit. Investigation showed the requestors were not being properly cleared, and an update has been added to return orphan batch requestors.”
If you’re not on this version, consider upgrading or implementing a similar fix through a hotfix.
Perform a detailed analysis of your requestor count and implement proper requestor management:
While you’ve already increased JVM capacity from 8GB to 15GB, this is treating the symptom rather than the cause. However, you can optimize garbage collection:
As a temporary measure while implementing the above solutions, consider implementing scheduled node restarts during low-traffic periods. Since batch requestors are only cleared when the node restarts, this can provide temporary relief.
You mentioned cleaning up custom tables that were taking longer to respond. Continue to monitor these tables and consider:
After implementing these changes, closely monitor:
References: