stream service unavailability throws repeated exceptions filling disk space

Hi

We are on pega881; using vm based deployment; one of our nodes is stream node, one is Background + Search, one more Background and 3 web users. this is say our STG env.

we had a situation where the stream node went out of service and it caused repeated exceptions bringing down the stream node as also the 2 other nodes (BG+Search, BG) owing to repeated exceptions in logs.

I recreated by stopping the stream service via Dev studio.

We very well understand that stream service is a must and should be running for OOTB platform QP. But the QP are associated with Background node. We have just upgraded so nothing from application point of view that uses stream service via QP or via data flow - its the OOTB platform stuff.

Same exception is repeated and fills log file & thereby disk space in like 30mins.

Given it is affecting the background nodes - is there a way to avoid these repeated exceptions in logs.

We have configured notification in pdc for this which seemed to work fine, but a stream service down adversely affecting cluster seems not right.

Is there a config or setting to avoid this?

2023-06-07 12:23:31,011 [OBSCHEDULER_THREAD_3] [ STANDARD] (ubscriber.GenericStreamHandler) ERROR BMICN88SZ10B7V7TM3GVM7JOHTLOA7KI5A - Unable to poll the stream
java.util.concurrent.CompletionException: com.pega.fnx.stream.spi.StreamServiceException: Invalid configuration. Undefined stream provider end point.
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332) ~[?:?]
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347) ~[?:?]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:874) ~[?:?]
at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:887) ~[?:?]
at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2325) ~[?:?]
at com.pega.fnx.stream.spi.impl.metric.StreamSPIMetricsProxy.execute(StreamSPIMetricsProxy.java:44) ~[?:?]
at com.pega.fnx.stream.spi.impl.StreamSP.execute(StreamSP.java:61) ~[?:?]
at com.pega.fnx.stream.spi.loader.SPIProxy.execute(SPIProxy.java:43) ~[stream-api-3.0.12-20221118134502.jar:?]
at com.pega.platform.stream.StreamSPIStateValidator.execute(StreamSPIStateValidator.java:67) ~[stream.jar:?]
at com.pega.platform.stream.StreamSPIRequestValidator.execute(StreamSPIRequestValidator.java:57) ~[stream.jar:?]
at com.pega.fnx.stream.api.PollingSubscriber.poll(PollingSubscriber.java:85) ~[stream-api-3.0.12-20221118134502.jar:?]
at com.pega.decisionmonitoring.stream.subscriber.GenericStreamHandler.consumeFromStream(GenericStreamHandler.java:76) ~[monitoring-stream.jar:?]
at com.pega.decisionmonitoring.stream.subscriber.GenericStreamHandler.subscribeToNamedStream(GenericStreamHandler.java:106) ~[monitoring-stream.jar:?]
at com.pegarules.generated.activity.ra_action_pypollstream_01c25dd283b96e0d04796680e007b9bd.step2_circum0(ra_action_pypollstream_01c25dd283b96e0d04796680e007b9bd.java:258) ~[?:?]
at com.pegarules.generated.activity.ra_action_pypollstream_01c25dd283b96e0d04796680e007b9bd.perform(ra_action_pypollstream_01c25dd283b96e0d04796680e007b9bd.java:93) ~[?:?]
at com.pega.pegarules.session.internal.mgmt.Executable.doActivity(Executable.java:2872) ~[prprivate-session.jar:?]
at com.pega.platform.executor.jobscheduler.internal.ActivityExecutor.runActivity(ActivityExecutor.java:59) ~[pega-executor.jar:?]
at com.pega.platform.executor.jobscheduler.internal.ActivityExecutor.executeActivity(ActivityExecutor.java:51) ~[pega-executor.jar:?]
at com.pega.platform.executor.jobscheduler.internal.ActivityProcessor.executeActivity(ActivityProcessor.java:73) ~[pega-executor.jar:?]
at com.pega.platform.executor.jobscheduler.internal.ActivityProcessor.execute(ActivityProcessor.java:59) ~[pega-executor.jar:?]
at com.pega.platform.executor.jobscheduler.internal.ActivityProcessor.run(ActivityProcessor.java:110) ~[pega-executor.jar:?]
at com.pega.pegarules.session.internal.PRSessionProviderImpl.performTargetActionWithLock(PRSessionProviderImpl.java:1381) ~[prprivate-session.jar:?]
at com.pega.pegarules.session.internal.PRSessionProviderImpl.doWithRequestorLocked(PRSessionProviderImpl.java:1124) ~[prprivate-session.jar:?]
at com.pega.pegarules.session.internal.PRSessionProviderImpl.doWithRequestorLocked(PRSessionProviderImpl.java:1005) ~[prprivate-session.jar:?]
at com.pega.pegarules.session.internal.PRSessionProviderImplForModules.doWithRequestorLocked(PRSessionProviderImplForModules.java:83) ~[prprivate-session.jar:?]
at com.pega.platform.executor.jobscheduler.internal.ActivityProcessor.run(ActivityProcessor.java:101) ~[pega-executor.jar:?]
at com.pega.platform.executor.jobscheduler.internal.JobSchedulerProcessor.execute(JobSchedulerProcessor.java:110) ~[pega-executor.jar:?]
at com.pega.platform.executor.jobscheduler.internal.JobSchedulerProcessor.lambda$execute$0(JobSchedulerProcessor.java:92) ~[pega-executor.jar:?]
at com.pega.platform.executor.internal.LogContextDecorator.runInDecoratedScope(LogContextDecorator.java:38) ~[pega-executor.jar:?]
at com.pega.platform.executor.jobscheduler.internal.JobSchedulerProcessor.execute(JobSchedulerProcessor.java:90) ~[pega-executor.jar:?]
at com.pega.platform.executor.jobscheduler.scheduler.internal.JobRunTimeImpl.execute(JobRunTimeImpl.java:104) ~[executor.jar:?]
at com.pega.platform.executor.jobscheduler.scheduler.internal.JobRunTimeDecorator.execute(JobRunTimeDecorator.java:57) ~[executor.jar:?]
at com.pega.platform.executor.jobscheduler.scheduler.internal.JobExecutionTemplate.executeJob(JobExecutionTemplate.java:56) ~[executor.jar:?]
at com.pega.platform.executor.jobscheduler.scheduler.internal.JobExecutionTemplate.run(JobExecutionTemplate.java:46) ~[executor.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: com.pega.fnx.stream.spi.StreamServiceException: Invalid configuration. Undefined stream provider end point.
at com.pega.fnx.stream.spi.impl.kafka.KafkaAdminSettings.getBootstrapServers(KafkaAdminSettings.java:75) ~[?:?]
at com.pega.fnx.stream.spi.impl.kafka.KafkaSettingsProvider.applyCommonProperties(KafkaSettingsProvider.java:77) ~[?:?]
at com.pega.fnx.stream.spi.impl.kafka.KafkaSettingsProvider.getConsumerProperties(KafkaSettingsProvider.java:55) ~[?:?]
at com.pega.fnx.stream.spi.impl.kafka.KafkaServerImpl.getConsumer(KafkaServerImpl.java:52) ~[?:?]
at com.pega.fnx.stream.spi.impl.processor.SubscriberRequestProcessor.getConsumer(SubscriberRequestProcessor.java:47) ~[?:?]
at com.pega.fnx.stream.spi.impl.processor.GetMessageRequestProcessor.pullRecords(GetMessageRequestProcessor.java:92) ~[?:?]
at com.pega.fnx.stream.spi.impl.processor.GetMessageRequestProcessor.execute(GetMessageRequestProcessor.java:73) ~[?:?]
at com.pega.fnx.stream.spi.impl.processor.GetMessageRequestProcessor.execute(GetMessageRequestProcessor.java:50) ~[?:?]
at com.pega.fnx.stream.spi.impl.StreamSPNative.execute(StreamSPNative.java:71) ~[?:?]
at com.pega.fnx.stream.spi.impl.kafka.settings.StreamKafkaProxy.execute(StreamKafkaProxy.java:49) ~[?:?]

@Tanul_Thanvi My advice is to try Patch Release 8.8.2 with your STG env

Update: I just noticed that Patch Release 8.8.3 is now available

@shanp thanks shanp, we did try to but there was a bug in 882 which is confirmed to be fixed in 883; so we had to rollback to 881.

@Tanul_Thanvi So have figured out from the logs it is a job scheduler which is causing the repeated exceptions in logs, pyAggregateAndStoreDecisionResults calls pyPollStream which is having monitoring/logging code running almost every other second; apparently there is setting in prediction studio ‘Monitor model input and output data’ to turn this off, following which it seemed all good. we are not using prediction studio either given we have a different app and env for this