We have a Cluster Setup where 2 of 8 PROD Nodes are Stream Nodes. We are consuming some of the RealTime topics from a Pub-Sub Model.
The issue we are facing is,“Process kafka.Kafka restarted”, while looking at the details, it was showing
dt.event.group_label, nothing much written in the kafka logs or kafka GC logs.
Wondering if any one else faced this issue and identified a solution.
Our Pega Version is 8.6.1, we are running on IBM WebSphere 9.x
Note:
Multiple Pega Sev-1/Sev-2’s were Opened and those incidents helped in bringing the JVM’s back to live. Not the root cause.
Here is the latest INC-212980 for traige.
@SUMAN_GUMUDAVELLY
Solution has multiple configuration changes:
1. JVM Argument Change
Before:
-Xgcpolicy:balanced
After:
-Xgcpolicy:gencon -Xmn4096m -Xdump:system:none -Xdisableexplicitgc
&&
#2. Added a DSS: Pega-Engine • prconfig/dsm/services/stream/pyheapoptions/default
Value: -Xmx4G -Xms4G
&&
3. Updated the ulimit for openfiles settting for jvm as well as kafka instance [Default 1024, changed to 10000]
@:~> ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 128484
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 10000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 128484
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
&&
4. Updated WebSphere Connection Pool Settings:
Reap time out to 300
Unused timeout to 1800
Aged timeout to 1200
&&
5 SQL Server Setting:
Updated Cost Threshold for Parellalisam to 150 from 50
Updated Max Degree of Parellalisam from to 8 from 0