problem starting jboss on pega server

Hi, we have a problem starting the 3th instance of JBOSS on application server PEGA, the error ocurred is :

19:06:44,487 WARNING [com.hazelcast.spi.impl.BasicInvocation] (hz._hzInstance_1_668e03f97c14a138d2c4267ce8ce03b2.response) [xx.xx.xx.xxx]:5702 [668e03f97c14a138d2c4267ce8ce03b2] [3.2] Retrying invocation: BasicInvocation{ serviceName=‘hz:impl:mapService’, op=com.hazelcast.map.operation.MapKeySetOperation@7c9fb2eb, partitionId=130, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=240, callTimeout=60000, target=Address[xx.xx.xx.xxx]:5701}, Reason: com.hazelcast.spi.exception.PartitionMigrating
Exception: Partition is migrating! this:Address[xx.xx.xx.xxx]:5701, partitionId: 130, operation: com.hazelcast.map.operation.MapKeySetOperation, service: hz:impl:mapService
19:06:49,531 STDERR [stderr] (asxxxxxx.intranet.fw) com.hazelcast.spi.exception.PartitionMigratingException: Partition is migrating! this:Address[xx.xx.xx.xxx]:5701, partitionId: 130, operation: com.hazelcast.map.operation.MapKeySetOperation, service: hz:impl:mapService
19:06:49,531 STDERR [stderr] (asxxxxxx.intranet.fw) at com.hazelcast.spi.impl.BasicOperationService.processOperation(BasicOperationService.java:344)
19:06:49,531 STDERR [stderr] (asxxxxxx.intranet.fw) at com.hazelcast.spi.impl.BasicOperationService.processPacket(BasicOperationService.java:309)
19:06:49,532 STDERR [stderr] (asxxxxxx.intranet.fw) at com.hazelcast.spi.impl.BasicOperationService.access$400(BasicOperationService.java:102)
19:06:49,532 STDERR [stderr] (asxxxxxx.intranet.fw) at com.hazelcast.spi.impl.BasicOperationService$BasicOperationProcessorImpl.process(BasicOperationService.java:756)
19:06:49,532 STDERR [stderr] (asxxxxxx.intranet.fw) at com.hazelcast.spi.impl.BasicOperationScheduler$PartitionThread.process(BasicOperationScheduler.java:276)
19:06:49,532 STDERR [stderr] (asxxxxxx.intranet.fw) at com.hazelcast.spi.impl.BasicOperationScheduler$PartitionThread.doRun(BasicOperationScheduler.java:270)
19:06:49,533 STDERR [stderr] (asxxxxxx.intranet.fw) at com.hazelcast.spi.impl.BasicOperationScheduler$PartitionThread.run(BasicOperationScheduler.java:245)
19:06:49,533 STDERR [stderr] (asxxxxxx.intranet.fw) at ------ End remote and begin local stack-trace ------.(Unknown Source)
19:06:49,533 STDERR [stderr] (asxxxxxx.intranet.fw) at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.resolveResponse(BasicInvocation.java:836)
19:06:49,534 STDERR [stderr] (asxxxxxx.intranet.fw) at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.resolveResponseOrThrowException(BasicInvocation.java:769)
19:06:49,534 STDERR [stderr] (asxxxxxx.intranet.fw) at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:696)
19:06:49,534 STDERR [stderr] (asxxxxxx.intranet.fw) at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:674)
19:06:49,534 STDERR [stderr] (asxxxxxx.intranet.fw) at com.hazelcast.spi.impl.BasicOperationService.invokeOnPartitions(BasicOperationService.java:613)
19:06:49,535 STDERR [stderr] (asxxxxxx.intranet.fw) at com.hazelcast.spi.impl.BasicOperationService.invokeOnAllPartitions(BasicOperationService.java:549)
19:06:49,535 STDERR [stderr] (asxxxxxx.intranet.fw) at com.hazelcast.map.proxy.MapProxySupport.keySetInternal(MapProxySupport.java:573)
19:06:49,535 STDERR [stderr] (asxxxxxx.intranet.fw) at com.hazelcast.map.proxy.MapProxyImpl.keySet(MapProxyImpl.java:479)
19:06:49,535 STDERR [stderr] (asxxxxxx.intranet.fw) at com.pega.pegarules.cluster.internal.PRClusterHazelcastImpl.checkMembershipConsistency(PRClusterHazelcastImpl.java:418)
19:06:49,536 STDERR [stderr] (asxxxxxx.intranet.fw) at com.pega.pegarules.session.internal.mgmt.PRNodeImpl.checkClusterConsistency(PRNodeImpl.java:2397)
19:06:49,536 STDERR [stderr] (asxxxxxx.intranet.fw) at com.pega.pegarules.session.internal.mgmt.PREnvironment.getThreadAndInitialize(PREnvironment.java:374)
19:06:49,536 STDERR [stderr] (asxxxxxx.intranet.fw) at com.pega.pegarules.session.internal.PRSessionProviderImpl.getThreadAndInitialize(PRSessionProviderImpl.java:1905)
19:06:49,536 STDERR [stderr] (asxxxxxx.intranet.fw) at com.pega.pegarules.session.internal.engineinterface.etier.impl.EngineStartup.initEngine(EngineStartup.java:657)
19:06:49,537 STDERR [stderr] (asxxxxxx.intranet.fw) at com.pega.pegarules.session.internal.engineinterface.etier.impl.EngineImpl._initEngine_privact(EngineImpl.java:165)
19:06:49,537 STDERR [stderr] (asxxxxxx.intranet.fw) at com.pega.pegarules.session.internal.engineinterface.etier.impl.EngineImpl.doStartup(EngineImpl.java:138)
19:06:49,537 STDERR [stderr] (asxxxxxx.intranet.fw) at com.pega.pegarules.web.servlet.WebAppLifeCycleListener._contextInitialized_privact(WebAppLifeCycleListener.java:280)
19:06:49,538 STDERR [stderr] (asxxxxxx.intranet.fw) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
19:06:49,539 STDERR [stderr] (asxxxxxx.intranet.fw) at java.lang.reflect.Method.invoke(Method.java:606)
19:06:49,539 STDERR [stderr] (asxxxxxx.intranet.fw) at com.pega.pegarules.internal.bootstrap.PRBootstrap.invokeMethod(PRBootstrap.java:338)
19:06:49,539 STDERR [stderr] (asxxxxxx.intranet.fw) at com.pega.pegarules.internal.bootstrap.PRBootstrap.invokeMethodPropagatingThrowable(PRBootstrap.java:379)
19:06:49,539 STDERR [stderr] (asxxxxxx.intranet.fw) at com.pega.pegarules.boot.internal.extbridge.AppServerBridgeToPega.invokeMethodPropagatingThrowable(AppServerBridgeToPega.java:216)
19:06:49,540 STDERR [stderr] (asxxxxxx.intranet.fw) at com.pega.pegarules.boot.internal.extbridge.AppServerBridgeToPega.invokeMethod(AppServerBridgeToPega.java:265)
19:06:49,540 STDERR [stderr] (asxxxxxx.intranet.fw) at com.pega.pegarules.internal.web.servlet.WebAppLifeCycleListenerBoot.contextInitialized(WebAppLifeCycleListenerBoot.java:83)
19:06:49,540 STDERR [stderr] (asxxxxxx.intranet.fw) at org.apache.catalina.core.StandardContext.contextListenerStart(StandardContext.java:3339)
19:06:49,540 STDERR [stderr] (asxxxxxx.intranet.fw) at org.apache.catalina.core.StandardContext.start(StandardContext.java:3780)
19:06:49,541 STDERR [stderr] (asxxxxxx.intranet.fw) at org.jboss.as.web.deployment.WebDeploymentService.doStart(WebDeploymentService.java:163)
19:06:49,541 STDERR [stderr] (asxxxxxx.intranet.fw) at org.jboss.as.web.deployment.WebDeploymentService.access$000(WebDeploymentService.java:61)
19:06:49,541 STDERR [stderr] (asxxxxxx.intranet.fw) at org.jboss.as.web.deployment.WebDeploymentService$1.run(WebDeploymentService.java:96)
19:06:49,542 STDERR [stderr] (asxxxxxx.intranet.fw) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
19:06:49,542 STDERR [stderr] (asxxxxxx.intranet.fw) at java.util.concurrent.FutureTask.run(FutureTask.java:262)
19:06:49,542 STDERR [stderr] (asxxxxxx.intranet.fw) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
19:06:49,542 STDERR [stderr] (asxxxxxx.intranet.fw) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
19:06:49,543 STDERR [stderr] (asxxxxxx.intranet.fw) at java.lang.Thread.run(Thread.java:745)
19:06:49,543 STDERR [stderr] (asxxxxxx.intranet.fw) at org.jboss.threads.JBossThread.run(JBossThread.java:122)
19:06:49,762 INFO [org.jboss.as.server] (Controller Boot Thread) JBAS015859: Deployed “SFDCGATEWAY.war” (runtime-name : “SFDCGATEWAY.war”)
19:06:49,762 INFO [org.jboss.as.server] (Controller Boot Thread) JBAS015859: Deployed “prgateway.war” (runtime-name : “prgateway.war”)
19:06:49,762 INFO [org.jboss.as.server] (ServerService Thread Pool – 34) JBAS015859: Deployed “prweb.war” (runtime-name : “prweb.war”)
19:06:49,763 INFO [org.jboss.as.server] (ServerService Thread Pool – 34) JBAS015859: Deployed “prhelp.war” (runtime-name : “prhelp.war”)
19:06:49,772 INFO [org.jboss.as] (Controller Boot Thread) JBAS015961: Http management interface listening on http://xx.xx.xx.xxx:9990/management
19:06:49,772 INFO [org.jboss.as] (Controller Boot Thread) JBAS015951: Admin console listening on http://xx.xx.xx.xxx:9990
19:06:49,773 INFO [org.jboss.as] (Controller Boot Thread) JBAS015874: JBoss EAP 6.4.21.GA (AS 7.5.21.Final-redhat-1) started in 351711ms - Started 533 of 559 services (90 services are lazy, passive or on-demand)
19:15:51,622 INFO [com.hazelcast.nio.SocketAcceptor] (hz._hzInstance_1_668e03f97c14a138d2c4267ce8ce03b2.IO.thread-Acceptor) [xx.xx.xx.xxx]:5702 [668e03f97c14a138d2c4267ce8ce03b2] [3.2] Accepting socket connection from /xx.xx.xx.xxx:40515
19:15:51,624 INFO [com.hazelcast.nio.TcpIpConnectionManager] (hz._hzInstance_1_668e03f97c14a138d2c4267ce8ce03b2.IO.thread-Acceptor) [xx.xx.xx.xxx]:5702 [668e03f97c14a138d2c4267ce8ce03b2] [3.2] 5702 accepted socket connection from /xx.xx.xx.xxx:40515
19:17:51,954 INFO [com.hazelcast.nio.TcpIpConnection] (hz._hzInstance_1_668e03f97c14a138d2c4267ce8ce03b2.IO.thread-in-2) [xx.xx.xx.xxx]:5702 [668e03f97c14a138d2c4267ce8ce03b2] [3.2] Connection [/xx.xx.xx.xxx:40515] lost. Reason: java.io.EOFException[Remote socket closed!]

@FastwebPegaS please can you confirm that you logged INC-235356 support ticket for this? This will help the moderators track the issue and follow it to conclusion.

This type of error can be seen during search initiation. It is a known issue with sharding management during rolling restarts. This may have nothing to do with the startup except that search is unlikely to operate in this environment. Shutting down all nodes, emptying the index directory, and reindexing should resolve the search initialization issue. If the problem still appears to be a search issue, please share the thread dumps from the node startup with our support team.

The best course of action is to wait for our support team to analyse your logs and help you further.

@FastwebPegaS this issue cannot be processed any further by our support team due to lack of response.

The analysis of your issue is as follows

The logs show the below errors/exceptions :

Caused by: com.hazelcast.spi.exception.PartitionMigratingException: Partition is migrating! this:Address[WIPED DATA]:5701, partitionId: 3, operation: com.hazelcast.map.operation.MapKeySetOperation, service: hz:impl:mapService

Retrying invocation: BasicInvocation{ serviceName=‘hz:impl:mapService’, op=GetOperation{}, partitionId=45, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=110, callTimeout=60000, target=null}, Reason: com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[[WIPED DATA]:5702, target:null,

partitionId: 45, replicaIndex: 0, operation: com.hazelcast.map.operation.GetOperation, service: hz:impl:mapService

  • Based on the above stack trace it looks like target is null and partition having issue , This happens Specifically, when the target is null, this message means that this particular member doesn’t have the owner set for a specific partition. This means that the member didn’t get its partition table updated in time (a request was made before we were informed where the data lives in the grid).

What it means: In a healthy cluster, this should rarely occur as Hazelcast has delivered fixed in past releases which prevent the race condition between looking for data and getting the updated partition table information. In a split-brain situation when the cluster is fractured into many smaller clusters, partitions are lost (since some partitions may only have existed on nodes that are no longer part of a splintered group of nodes).

It’s also the case that frequent fracturing and merging causes the partition tables to experience delayed updates.

What to do:

In a healthy cluster, this should be a one-off and can safely be ignored. When the error is seen multiple times, it may indicate that the cluster is experiencing fracturing. Also check to make sure that there are no ports blocked by your firewall.

  • You can try the below steps to fix this issue for now and check if it helps:
  1. Take the DB backup.
  2. Bring all nodes down
  3. Truncate the pr_sys_statusnodes table (Take first DB backup)
  4. Bring up one node (preferably an index host node)
  5. Bring up the remainder of the nodes in parallel or in series as you wish

Note: It is recommended to perform the above steps in non-business hours after taking the DB backup.

If you still see the issue after performing the above then support has asked you to provide below details:

  • Cluster logs and pega logs
  • Prconfig details
  • Hfix scan reports.