Problem statement
Quite often as a system architect you are asked about system health especially when performance degradation or other issues are observed by users. This appears to a simple request but can be problematic when the affected system is the production one and you typically do not have production access. This is where PDC comes to the rescue. It provides a lot of information about the operational state of the system while keeping sensitive production data isolated.
Home screen
First indicator of potential issues can already be seen on PDC home screen.
Here you can see event statistics for each monitored environment. Number of events received and active cases can be interesting when tracked over longer period of time but as raw data can not easily be interpreted as high numbers here on a heavily used environment do not necessarily indicate a potential issue.
On the other hand, any number of throttled or urgent cases is worrying and should be investigated.
Throttled cases are defined as cases for which maximum number of alert messages with the same ID exceeds the defined thresholds and the system suppresses further alerts with the same ID are temporarily. This could happen for instance when an external service frequently called by a Pega application suddenly becomes overloaded and its response time increases. This will result in an exceptionally high number of events received. When this happens, PDC can automatically increase the default threshold for this kind of event so that it is not overflooded with such events.
PDC designates certain alerts to be Urgent events that require immediate attention, for example, when agents become disabled or an application crashes. These cases need to be investigated as a failed background task or application crash could have resulted in data loss and remediation might be required.
Case Explorer
Case Explorer landing page displays all cases for which an event was received in the last seven days. They can be filtered type, impact and date.
To see cases for throttled events, filter Status by Open-ForcedElevatedKPI. To see urgent cases, apply filter on Is urgent column. After applying desired filters you can analyse details of problematic cases. To see all available information about an individual case, for instance from urgent category, open the case by clicking a link in Case ID column.
Case view
After opening a case, on the Overview tab you can see detailed information about the event which triggered its creation, list of Pega applications for which the event was recorded, whether it affects end users, background processing or services as well as occurrence trends. Additionally, you can see if case for the same event exists in other systems. Thanks to that you can for instance verify if a SQL query is underperforming only in production or in lower environments as well.
Origin tab provides additional information allowing to understand the alert, for instance name of the rule causing the issue or exception stack trace. Depending on event type additional tabs may be available, for instance database query related events have SQL Analysis tab allowing you to analyse the problematic SQL query or event.
Event Distribution tab displayed detailed trend graphs for the event together with a list of recent Pega application and platform updates allowing you to instantly spot possible correlations.
Events tab list all recent events that follow the same event pattern which led to creation of this case. You can use it to compare individual events for additional patterns or differences.
Insights tab displays detailed statistical analysis of events and their distribution according to event KPI. Please note that these statistics are generated from events received by PDC, therefore only events which have exceeded desired threshold are taken into account. For this reason, this data cannot be used to assess general performance for instance of a service call because all calls which have been completed within the threshold were not taken into account.
PDC Scores
For production level systems, PDC Scores landing page is an attempt to assess general state of a system based on most important factors.
Three tabs are available and a separate score for each category is calculated and recommended remediation actions are listed:
- PDC Process Health assess active use of PDC for monitoring and resolving system issues. Low score here usually means that not enough attention is paid to actively monitoring system health in PDC.
- Reliability assesses system stability based on state of important system services but it also calculates how many cases in production system could have been avoided by addressing them in lower environments before promoting code to production.
- Performance assesses browser and service response time trends as well as database score. Low score can indicate performance issues noticeable by end users which can lead to low satisfaction.
System Resources
Another good starting point when investigating system issues or simply monitoring its health are landing pages grouped under System Resources category.
Following landing pages are available:
- Database displays statistics for database size and growth trend for up to last 12 months. This is useful when unexpected or undesirable growth of database size occurs. Tables tab allows you to dig into statistics on individual table level including its size, row number of number of performed operations. Additionally you can check index information and their usage statistics. This is invaluable when investigating slow database queries. Query Stats allows you to identify globally most time consuming or frequent queries.
- Nodes list currently active system nodes together with their status and basic infromation.
- Resource Utilization displays graphs of CPU and Heap utilisation as well as requestor number for all nodes of specific type or an individual node. When you know a system issue occurred at a specific time, you can use this page to observe any abnormalities related to the event.
- Search Information displays search service state and index statistics.
- DSS List displays Dynamic System Setting instances together with their value and update time allowing you to track and asses changes thereof.
- JVM Monitoring displays graph related to JVM memory usage, for instance garbage collection runs and is useful when investigating memory related issues, for instance out of memory errors.
Background Processing
Background Processing landing page displays information about state of different types of background tasks.
Information about Agents, Agent queues, Job Schedulers, Listeners and Queue Processors are available. Information available is similar to what can be found in Admin Studio of Pega, with the added advantage of PDC also collecting historical data. This allows you to for instance identify time of day when a Queue Processor receives most items to process or when most of them end up in broken queue, or monitor state and performance of a Job Scheduler. This is extremely useful especially in production systems where direct access to Pega is very limited.





