Node initialize issue after upgrade to 24.1.2

The system is upgraded from 8.8.1 to 24.1.2 .one of the webnode is failed to initialize as throwing the below error in logs. while no errors in other nodes and the same cluster configuration works fine in 8.8.1 (in terms of IP address and Ports).

the cluster is having the 4 nodes

  • stream :- 1
  • Backgroundprocess/search/BIX :- 1
  • web user --2

I tried to set in prconfig.xml of all nodes and cleaned up the stream related tables and kafka-data folder in stream node as well but still issue is persist. Are we missing something here on this node .?Please suggest.

2025-03-20 16:45:17,589 [ALIZE_STREAM_SERVICE] [ STANDARD] (.stream.StreamServiceValidator) ERROR - Failed to initialize StreamAPI: attempt 1 from 6. Next retry in 20 seconds
java.util.concurrent.TimeoutException: null
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) ~[?:1.8.0_202]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) ~[?:1.8.0_202]
at com.pega.platform.stream.StreamServiceValidator.ping(StreamServiceValidator.java:91) ~[stream.jar:?]
at com.pega.platform.stream.StreamServiceValidator.validate(StreamServiceValidator.java:69) ~[stream.jar:?]
at com.pega.platform.stream.StreamService.getStreamAPI(StreamService.java:74) ~[stream.jar:?]
at com.pega.platform.modules.internal.guice.StreamServiceModule.createStreamAPI(StreamServiceModule.java:58) ~[modules-bridge.jar:?]
at com.google.inject.internal.ProviderInternalFactory.provision(ProviderInternalFactory.java:81) ~[guice-4.0.jar:?]
at com.google.inject.internal.InternalFactoryToInitializableAdapter.provision(InternalFactoryToInitializableAdapter.java:53) ~[guice-4.0.jar:?]
at com.google.inject.internal.ProviderInternalFactory.circularGet(ProviderInternalFactory.java:61) ~[guice-4.0.jar:?]
at com.google.inject.internal.InternalFactoryToInitializableAdapter.get(InternalFactoryToInitializableAdapter.java:45) ~[guice-4.0.jar:?]
at com.google.inject.internal.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:46) ~[guice-4.0.jar:?]
at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1103) ~[guice-4.0.jar:?]
at com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40) ~[guice-4.0.jar:?]
at com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:145) ~[guice-4.0.jar:?]
at com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:41) ~[guice-4.0.jar:?]
at com.google.inject.internal.InjectorImpl$2$1.call(InjectorImpl.java:1016) ~[guice-4.0.jar:?]
at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1092) ~[guice-4.0.jar:?]
at com.google.inject.internal.InjectorImpl$2.get(InjectorImpl.java:1012) ~[guice-4.0.jar:?]
at com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1051) ~[guice-4.0.jar:?]
at com.pega.platform.modules.internal.ModulesBridgeImpl.getStreamAPI(ModulesBridgeImpl.java:726) ~[modules-bridge.jar:?]
at com.pega.dsm.dnode.api.server.StreamServerService.getStreamAPI(StreamServerService.java:232) ~[d-node.jar:?]
at com.pega.dsm.dnode.api.server.StreamServiceInitializationTask.validateStreamAPI(StreamServiceInitializationTask.java:95) ~[d-node.jar:?]
at com.pega.dsm.dnode.api.server.StreamServiceInitializationTask.access$100(StreamServiceInitializationTask.java:32) ~[d-node.jar:?]
at com.pega.dsm.dnode.api.server.StreamServiceInitializationTask$1.run(StreamServiceInitializationTask.java:82) ~[d-node.jar:?]
at com.pega.dsm.dnode.api.server.StreamServiceInitializationTask$1.run(StreamServiceInitializationTask.java:77) ~[d-node.jar:?]
at com.pega.dsm.dnode.util.PrpcRunnable.execute(PrpcRunnable.java:77) ~[d-node.jar:?]
at com.pega.dsm.dnode.impl.prpc.service.ServiceHelper$2.run(ServiceHelper.java:295) ~[d-node.jar:?]
at com.pega.pegarules.session.internal.PRSessionProviderImpl.performTargetActionWithLock(PRSessionProviderImpl.java:1379) ~[prprivate-session.jar:?]
at com.pega.pegarules.session.internal.PRSessionProviderImpl.doWithRequestorLocked(PRSessionProviderImpl.java:1122) ~[prprivate-session.jar:?]
at com.pega.pegarules.session.internal.PRSessionProviderImpl.doWithRequestorLocked(PRSessionProviderImpl.java:929) ~[prprivate-session.jar:?]
at com.pega.dsm.dnode.impl.prpc.service.ServiceHelper.executeInPrpcContextInternal(ServiceHelper.java:288) ~[d-node.jar:?]
at com.pega.dsm.dnode.impl.prpc.service.ServiceHelper.executeInPrpcContext(ServiceHelper.java:150) ~[d-node.jar:?]
at com.pega.dsm.dnode.api.server.StreamServiceInitializationTask.initStreamServerServiceAndValidate(StreamServiceInitializationTask.java:77) ~[d-node.jar:?]
at com.pega.dsm.dnode.api.server.StreamServiceInitializationTask.runTask(StreamServiceInitializationTask.java:51) ~[d-node.jar:?]
at com.pega.pegarules.session.internal.async.startup.StartupTask$1.run(StartupTask.java:55) ~[prprivate-session.jar:?]
at com.pega.pegarules.session.internal.PRSessionProviderImpl.performTargetActionWithLock(PRSessionProviderImpl.java:1379) ~[prprivate-session.jar:?]
at com.pega.pegarules.session.internal.PRSessionProviderImpl.doWithRequestorLocked(PRSessionProviderImpl.java:1122) ~[prprivate-session.jar:?]
at com.pega.pegarules.session.internal.PRSessionProviderImpl.doWithRequestorLocked(PRSessionProviderImpl.java:929) ~[prprivate-session.jar:?]
at com.pega.pegarules.session.internal.PRSessionProviderImpl.doWithRequestorLocked(PRSessionProviderImpl.java:895) ~[

@Brahmesh@ Whenever you log a parallel question on the PSC as well as having opened a GCS ticket on MSP, could you mention that here?

I see that you have (correctly) opened an incident: INC-C13225 (Node fails to initialize after upgrade to 24.1.2).

We should let GCS investigate more closely.

Once resolved, please update the solution here with a reply which you can mark as Accept Solution.

Please also go through the troubleshooting tips given by our GenAI Assistant:

:warning: This is a GenAI-powered tool. All generated answers require validation against the provided references.

After analyzing the error logs, I’ve identified several potential solutions for resolving this issue:

Possible Causes and Solutions### 1. Timeout Configuration Settings

In Pega 24.1.2, the stream delivery timeout and request timeout are configurable values. The TimeoutException may be occurring because the default settings are not sufficient for your environment.

Try adding these settings to your prconfig.xml:

<env name="services/stream/producer/delivery/timeout" value="120000"/>
<env name="services/stream/request/timeout" value="30000"/>

The delivery timeout default is 2 minutes (120000ms), and the request timeout default is 30 seconds (30000ms). You might need to increase these values, but always ensure that the delivery timeout is greater than the request timeout.

2. Node Network Region and Latency

The failing node might be experiencing network latency issues when communicating with Kafka. Ensure that the failing webnode is in the same network region as the other nodes to avoid latency problems that could cause timeouts.

3. Kafka Service Configuration

The TimeoutException could indicate that the Kafka service isn’t properly configured or accessible from the failing webnode. Check that:

  • Kafka is properly installed and running
  • The ports required for Kafka (9092) and Stream REST service (7003) are open and accessible
  • No firewall or network issues are preventing communication

4. JVM and System Resources

The issue might be related to insufficient resources on the failing node:

  • Check if the JVM has adequate memory allocation
  • Ensure the system’s ulimit settings are properly configured (if running on Linux/Unix)
  • Verify that the node has sufficient CPU resources available

5. Clock Synchronization

Ensure that all nodes in your cluster have synchronized clocks. The Stream service requires clocks to be synchronized within a 30-second window, preferably using NTP.

6. Stream Node Type Configuration

Since you mentioned this is a cluster with 4 nodes, verify that your stream node type is correctly configured:

<env name="dsm/services/stream/pyBrokerAddress" value="[IP_ADDRESS]"/>
<env name="dsm/services/stream/pyBrokerPort" value="9092"/>

7. Kafka Process Management

Sometimes, the Kafka Java process might not have shutdown properly during previous operations. Try checking if any Kafka processes are still running when the server is shutdown, and ensure they are properly terminated before restarting.

8. Provider Configuration

Although you mentioned you’ve already tried setting <env name="services/stream/provider" value="Internal"/>, you might want to verify if this is the correct setting for your environment. In some cases, especially after an upgrade to 24.1.2, you may need to use:

<env name="services/stream/provider" value="KafkaService"/>
<env name="services/stream/name/pattern" value="[isolation-id]-{stream.name}"/>
<env name="dsm/services/stream/clustername" value="[cluster-name]"/>

Next Steps

  1. Start by examining the JVM arguments and settings for the failing node, ensuring they match those on the working nodes
  2. Adjust the timeout settings mentioned above
  3. Verify network connectivity between the failing node and Kafka
  4. Check system resource availability
  5. If none of these steps help, consider reinstalling the Stream service on the affected node

References:

@MarijeSchillern Thanks for the update ..I tried to set the timedOut value but it didn’t work .

I don’t think so, it’s an configuration issue because the same settings of the cluster worked before in 8.8.1 as all nodes used to join the cluster and communicate each other , As part of the upgrade 24.1.2, we didn’t make any changes to prconfig.xml and JVM arguments in tomcat instance ..

After In-place upgrade,Just cleared the temp directories and pr_data_stream tables and replace the old prweb war file with new prweb.war file and restart the application servers and three nodes are joined the clusters successfully but only with one web node .

I will let GCS expert to investigate it .

@Brahmesh@ The support ticket was closed with the following info:




Resolution reason: Explanation provided
Solution type: Local Change - Implementation
Solution type description:



During our meeting, we accomplished the following:



- Added a new setting to prconfig.xml for the non-functional web node:






Disabled the unnecessary validation with the stream node, which was causing issues for the web node.



​​​​​​



- Deleted the Kafka folder.
- Truncated the following tables:
- pr_data_stream_nodes
- pr_data_stream_sessions
- pr_data_stream_node_updates
- pr_data_qp_run_partition


After executing these actions, the web node has been successfully initialized upon restart.