8.6 PEGA multinode deployment can takes an hour to recover from a node failure during a campaign run

We are deploying a non-standard 2-node universal deployment, and while we understand this is not recommended, we are investigating node recovery behavior in failure scenarios.

During our tests, we simulated a failure by deleting a single node while an active campaign was running. As part of the restoration process, we restored the embedded Kafka and Cassandra data before restarting the node. However, we observed inconsistent rejoin times:

  • In some cases, the node rejoined the cluster within our usual startup time of approximately 15 minutes.
  • In other cases, it took up to an hour for the node to rejoin.
  • On one occasion, the remaining active node also crashed, leaving both nodes offline for 15 minutes. Eventually, both nodes came back online and rejoined the cluster.

We would appreciate any insights into why a node might fail to rejoin promptly or cause instability in the remaining node. I can provide relevant logs for further analysis—please let me know which specific logs would be most helpful.

Thanks for your time and support.

@SeanR17369101

:warning: This is a GenAI-powered answer. All generated answers require validation against the provided references.

Based on your scenario, I can recommend specific logs that would be most valuable for diagnosing these issues.

For a comprehensive analysis of node rejoining problems in your Pega 8.6 deployment with embedded Kafka and Cassandra, please collect the following logs from both nodes:

  1. Cassandra Logs - Found in the work/Catalina/localhost/prweb/Cassandra*log* directory
    These logs are critical for identifying errors and warnings related to the embedded Cassandra service, which could explain the delayed rejoin times. Pay special attention to any replication issues, seed node communication problems, or resource constraints.
  2. PegaRULES Log - Primary application log
    This contains debugging information about system errors and exceptions that may be affecting node recovery and performance during your campaign runs. Look for any Kafka or Cassandra connectivity issues.
  3. PegaALERT Log - Performance and threshold alerts
    This log captures diagnostic messages for failures and system events that exceed performance thresholds, which can help identify bottlenecks causing slow rejoin times.
  4. PegaCLUSTER Log - Cluster management information
    This provides crucial information about the setup and runtime behavior of your cluster, which is essential for understanding node recovery dynamics and communication issues between nodes.
  5. Kafka Server Logs - For embedded Kafka instances
    Since your scenario involves campaign runs which heavily rely on Kafka for message processing, these logs could reveal issues with Kafka topic replication, partition management, or consumer group rebalancing that might be contributing to the instability.

For the most effective analysis, please collect these logs from:

  • The node that was deleted and later rejoined (focusing on the periods of slow rejoining)
  • The remaining active node (especially in the case where it also crashed)
  • Any timestamps corresponding to when both nodes were offline for 15 minutes

Additionally, if you have access to JVM monitoring data or heap dumps from when the issues occurred, these could also provide valuable insights into potential resource constraints affecting the recovery process.

The fact that you’ve observed the remaining node sometimes crashing when the first node is deleted suggests there might be resource contention issues or improper handling of partition rebalancing during campaign operations. The logs should help reveal whether this is related to Kafka topic replication, Cassandra data synchronization, or another underlying issue.

References:

  1. Troubleshooting Cassandra
  2. Recovering a node
  3. Monitoring Cassandra errors and warnings
  4. Requesting support services