Kafka getting restarted multiple times a day

We have a Cluster Setup where 2 of 8 PROD Nodes are Stream Nodes. We are consuming some of the RealTime topics from a Pub-Sub Model.

The issue we are facing is,“Process kafka.Kafka restarted”, while looking at the details, it was showing

dt.event.group_label, nothing much written in the kafka logs or kafka GC logs.

Wondering if any one else faced this issue and identified a solution.

Our Pega Version is 8.6.1, we are running on IBM WebSphere 9.x

Note:

Multiple Pega Sev-1/Sev-2’s were Opened and those incidents helped in bringing the JVM’s back to live. Not the root cause.

Here is the latest INC-212980 for traige.

@SUMAN_GUMUDAVELLY

Solution has multiple configuration changes:

1. JVM Argument Change

Before:
-Xgcpolicy:balanced

After:
-Xgcpolicy:gencon -Xmn4096m -Xdump:system:none -Xdisableexplicitgc

&&

#2. Added a DSS: Pega-Engine • prconfig/dsm/services/stream/pyheapoptions/default
Value: -Xmx4G -Xms4G

&&

3. Updated the ulimit for openfiles settting for jvm as well as kafka instance [Default 1024, changed to 10000]

@:~> ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 128484
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 10000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 128484
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

&&

4. Updated WebSphere Connection Pool Settings:
Reap time out to 300
Unused timeout to 1800
Aged timeout to 1200

&&
5 SQL Server Setting:

Updated Cost Threshold for Parellalisam to 150 from 50
Updated Max Degree of Parellalisam from to 8 from 0